Artificial Intelligence 150
☆ Studying Classifier(-Free) Guidance From a Classifier-Centric Perspective
Classifier-free guidance has become a staple for conditional generation with
denoising diffusion models. However, a comprehensive understanding of
classifier-free guidance is still missing. In this work, we carry out an
empirical study to provide a fresh perspective on classifier-free guidance.
Concretely, instead of solely focusing on classifier-free guidance, we trace
back to the root, i.e., classifier guidance, pinpoint the key assumption for
the derivation, and conduct a systematic study to understand the role of the
classifier. We find that both classifier guidance and classifier-free guidance
achieve conditional generation by pushing the denoising diffusion trajectories
away from decision boundaries, i.e., areas where conditional information is
usually entangled and is hard to learn. Based on this classifier-centric
understanding, we propose a generic postprocessing step built upon
flow-matching to shrink the gap between the learned distribution for a
pre-trained denoising diffusion model and the real data distribution, majorly
around the decision boundaries. Experiments on various datasets verify the
effectiveness of the proposed approach.
☆ A Frustratingly Simple Yet Highly Effective Attack Baseline: Over 90% Success Rate Against the Strong Black-box Models of GPT-4.5/4o/o1
Despite promising performance on open-source large vision-language models
(LVLMs), transfer-based targeted attacks often fail against black-box
commercial LVLMs. Analyzing failed adversarial perturbations reveals that the
learned perturbations typically originate from a uniform distribution and lack
clear semantic details, resulting in unintended responses. This critical
absence of semantic information leads commercial LVLMs to either ignore the
perturbation entirely or misinterpret its embedded semantics, thereby causing
the attack to fail. To overcome these issues, we notice that identifying core
semantic objects is a key objective for models trained with various datasets
and methodologies. This insight motivates our approach that refines semantic
clarity by encoding explicit semantic details within local regions, thus
ensuring interoperability and capturing finer-grained features, and by
concentrating modifications on semantically rich areas rather than applying
them uniformly. To achieve this, we propose a simple yet highly effective
solution: at each optimization step, the adversarial image is cropped randomly
by a controlled aspect ratio and scale, resized, and then aligned with the
target image in the embedding space. Experimental results confirm our
hypothesis. Our adversarial examples crafted with local-aggregated
perturbations focused on crucial regions exhibit surprisingly good
transferability to commercial LVLMs, including GPT-4.5, GPT-4o,
Gemini-2.0-flash, Claude-3.5-sonnet, Claude-3.7-sonnet, and even reasoning
models like o1, Claude-3.7-thinking and Gemini-2.0-flash-thinking. Our approach
achieves success rates exceeding 90% on GPT-4.5, 4o, and o1, significantly
outperforming all prior state-of-the-art attack methods. Our optimized
adversarial examples under different configurations and training code are
available at https://github.com/VILA-Lab/M-Attack.
comment: Code at: https://github.com/VILA-Lab/M-Attack
☆ Uncertainty in Action: Confidence Elicitation in Embodied Agents
Tianjiao Yu, Vedant Shah, Muntasir Wahed, Kiet A. Nguyen, Adheesh Juvekar, Tal August, Ismini Lourentzou
Expressing confidence is challenging for embodied agents navigating dynamic
multimodal environments, where uncertainty arises from both perception and
decision-making processes. We present the first work investigating embodied
confidence elicitation in open-ended multimodal environments. We introduce
Elicitation Policies, which structure confidence assessment across inductive,
deductive, and abductive reasoning, along with Execution Policies, which
enhance confidence calibration through scenario reinterpretation, action
sampling, and hypothetical reasoning. Evaluating agents in calibration and
failure prediction tasks within the Minecraft environment, we show that
structured reasoning approaches, such as Chain-of-Thoughts, improve confidence
calibration. However, our findings also reveal persistent challenges in
distinguishing uncertainty, particularly under abductive settings, underscoring
the need for more sophisticated embodied confidence elicitation methods.
comment: Project page: https://plan-lab.github.io/ece/
☆ SciVerse: Unveiling the Knowledge Comprehension and Visual Reasoning of LMMs on Multi-modal Scientific Problems
The rapid advancement of Large Multi-modal Models (LMMs) has enabled their
application in scientific problem-solving, yet their fine-grained capabilities
remain under-explored. In this paper, we introduce SciVerse, a multi-modal
scientific evaluation benchmark to thoroughly assess LMMs across 5,735 test
instances in five distinct versions. We aim to investigate three key dimensions
of LMMs: scientific knowledge comprehension, multi-modal content
interpretation, and Chain-of-Thought (CoT) reasoning. To unveil whether LMMs
possess sufficient scientific expertise, we first transform each problem into
three versions containing different levels of knowledge required for solving,
i.e., Knowledge-free, -lite, and -rich. Then, to explore how LMMs interpret
multi-modal scientific content, we annotate another two versions, i.e.,
Vision-rich and -only, marking more question information from texts to
diagrams. Comparing the results of different versions, SciVerse systematically
examines the professional knowledge stock and visual perception skills of LMMs
in scientific domains. In addition, to rigorously assess CoT reasoning, we
propose a new scientific CoT evaluation strategy, conducting a step-wise
assessment on knowledge and logical errors in model outputs. Our extensive
evaluation of different LMMs on SciVerse reveals critical limitations in their
scientific proficiency and provides new insights into future developments.
Project page: https://sciverse-cuhk.github.io
comment: Initially released in September 2024. Project page:
https://sciverse-cuhk.github.io
☆ NIL: No-data Imitation Learning by Leveraging Pre-trained Video Diffusion Models
Acquiring physically plausible motor skills across diverse and unconventional
morphologies-including humanoid robots, quadrupeds, and animals-is essential
for advancing character simulation and robotics. Traditional methods, such as
reinforcement learning (RL) are task- and body-specific, require extensive
reward function engineering, and do not generalize well. Imitation learning
offers an alternative but relies heavily on high-quality expert demonstrations,
which are difficult to obtain for non-human morphologies. Video diffusion
models, on the other hand, are capable of generating realistic videos of
various morphologies, from humans to ants. Leveraging this capability, we
propose a data-independent approach for skill acquisition that learns 3D motor
skills from 2D-generated videos, with generalization capability to
unconventional and non-human forms. Specifically, we guide the imitation
learning process by leveraging vision transformers for video-based comparisons
by calculating pair-wise distance between video embeddings. Along with
video-encoding distance, we also use a computed similarity between segmented
video frames as a guidance reward. We validate our method on locomotion tasks
involving unique body configurations. In humanoid robot locomotion tasks, we
demonstrate that 'No-data Imitation Learning' (NIL) outperforms baselines
trained on 3D motion-capture data. Our results highlight the potential of
leveraging generative video models for physically plausible skill learning with
diverse morphologies, effectively replacing data collection with data
generation for imitation learning.
☆ LHM: Large Animatable Human Reconstruction Model from a Single Image in Seconds
Lingteng Qiu, Xiaodong Gu, Peihao Li, Qi Zuo, Weichao Shen, Junfei Zhang, Kejie Qiu, Weihao Yuan, Guanying Chen, Zilong Dong, Liefeng Bo
Animatable 3D human reconstruction from a single image is a challenging
problem due to the ambiguity in decoupling geometry, appearance, and
deformation. Recent advances in 3D human reconstruction mainly focus on static
human modeling, and the reliance of using synthetic 3D scans for training
limits their generalization ability. Conversely, optimization-based video
methods achieve higher fidelity but demand controlled capture conditions and
computationally intensive refinement processes. Motivated by the emergence of
large reconstruction models for efficient static reconstruction, we propose LHM
(Large Animatable Human Reconstruction Model) to infer high-fidelity avatars
represented as 3D Gaussian splatting in a feed-forward pass. Our model
leverages a multimodal transformer architecture to effectively encode the human
body positional features and image features with attention mechanism, enabling
detailed preservation of clothing geometry and texture. To further boost the
face identity preservation and fine detail recovery, we propose a head feature
pyramid encoding scheme to aggregate multi-scale features of the head regions.
Extensive experiments demonstrate that our LHM generates plausible animatable
human in seconds without post-processing for face and hands, outperforming
existing methods in both reconstruction accuracy and generalization ability.
comment: Project Page: https://lingtengqiu.github.io/LHM/
☆ ETCH: Generalizing Body Fitting to Clothed Humans via Equivariant Tightness
Fitting a body to a 3D clothed human point cloud is a common yet challenging
task. Traditional optimization-based approaches use multi-stage pipelines that
are sensitive to pose initialization, while recent learning-based methods often
struggle with generalization across diverse poses and garment types. We propose
Equivariant Tightness Fitting for Clothed Humans, or ETCH, a novel pipeline
that estimates cloth-to-body surface mapping through locally approximate SE(3)
equivariance, encoding tightness as displacement vectors from the cloth surface
to the underlying body. Following this mapping, pose-invariant body features
regress sparse body markers, simplifying clothed human fitting into an
inner-body marker fitting task. Extensive experiments on CAPE and 4D-Dress show
that ETCH significantly outperforms state-of-the-art methods -- both
tightness-agnostic and tightness-aware -- in body fitting accuracy on loose
clothing (16.7% ~ 69.5%) and shape accuracy (average 49.9%). Our equivariant
tightness design can even reduce directional errors by (67.2% ~ 89.8%) in
one-shot (or out-of-distribution) settings. Qualitative results demonstrate
strong generalization of ETCH, regardless of challenging poses, unseen shapes,
loose clothing, and non-rigid dynamics. We will release the code and models
soon for research purposes at https://boqian-li.github.io/ETCH/.
comment: Page: https://boqian-li.github.io/ETCH/, Code:
https://github.com/boqian-li/ETCH
★ Transformers without Normalization CVPR 2025
Normalization layers are ubiquitous in modern neural networks and have long
been considered essential. This work demonstrates that Transformers without
normalization can achieve the same or better performance using a remarkably
simple technique. We introduce Dynamic Tanh (DyT), an element-wise operation
$DyT($x$) = \tanh(\alpha $x$)$, as a drop-in replacement for normalization
layers in Transformers. DyT is inspired by the observation that layer
normalization in Transformers often produces tanh-like, $S$-shaped input-output
mappings. By incorporating DyT, Transformers without normalization can match or
exceed the performance of their normalized counterparts, mostly without
hyperparameter tuning. We validate the effectiveness of Transformers with DyT
across diverse settings, ranging from recognition to generation, supervised to
self-supervised learning, and computer vision to language models. These
findings challenge the conventional understanding that normalization layers are
indispensable in modern neural networks, and offer new insights into their role
in deep networks.
comment: CVPR 2025; Project page: https://jiachenzhu.github.io/DyT/
☆ Siege: Autonomous Multi-Turn Jailbreaking of Large Language Models with Tree Search ICLR 2025
We introduce Siege, a multi-turn adversarial framework that models the
gradual erosion of Large Language Model (LLM) safety through a tree search
perspective. Unlike single-turn jailbreaks that rely on one meticulously
engineered prompt, Siege expands the conversation at each turn in a
breadth-first fashion, branching out multiple adversarial prompts that exploit
partial compliance from previous responses. By tracking these incremental
policy leaks and re-injecting them into subsequent queries, Siege reveals how
minor concessions can accumulate into fully disallowed outputs. Evaluations on
the JailbreakBench dataset show that Siege achieves a 100% success rate on
GPT-3.5-turbo and 97% on GPT-4 in a single multi-turn run, using fewer queries
than baselines such as Crescendo or GOAT. This tree search methodology offers
an in-depth view of how model safeguards degrade over successive dialogue
turns, underscoring the urgency of robust multi-turn testing procedures for
language models.
comment: Accepted to ICLR 2025 Trustworthy LLM
☆ Compositional Subspace Representation Fine-tuning for Adaptive Large Language Models ICLR 2025
Adapting large language models to multiple tasks can cause cross-skill
interference, where improvements for one skill degrade another. While methods
such as LoRA impose orthogonality constraints at the weight level, they do not
fully address interference in hidden-state representations. We propose
Compositional Subspace Representation Fine-tuning (CS-ReFT), a novel
representation-based approach that learns multiple orthonormal subspace
transformations, each specializing in a distinct skill, and composes them via a
lightweight router. By isolating these subspace edits in the hidden state,
rather than weight matrices, CS-ReFT prevents cross-task conflicts more
effectively. On the AlpacaEval benchmark, applying CS-ReFT to Llama-2-7B
achieves a 93.94% win rate, surpassing GPT-3.5 Turbo (86.30%) while requiring
only 0.0098% of model parameters. These findings show that specialized
representation edits, composed via a simple router, significantly enhance
multi-task instruction following with minimal overhead.
comment: Accepted to ICLR 2025 SCOPE
☆ Dual-Stage Cross-Modal Network with Dynamic Feature Fusion for Emotional Mimicry Intensity Estimation
Emotional Mimicry Intensity (EMI) estimation serves as a critical technology
for understanding human social behavior and enhancing human-computer
interaction experiences, where the core challenge lies in dynamic correlation
modeling and robust fusion of multimodal temporal signals. To address the
limitations of existing methods in insufficient exploitation of modal
synergistic effects, noise sensitivity, and limited fine-grained alignment
capabilities, this paper proposes a dual-stage cross-modal alignment framework.
First, we construct vision-text and audio-text contrastive learning networks
based on an improved CLIP architecture, achieving preliminary alignment in the
feature space through modality-decoupled pre-training. Subsequently, we design
a temporal-aware dynamic fusion module that combines Temporal Convolutional
Networks (TCN) and gated bidirectional LSTM to respectively capture the
macro-evolution patterns of facial expressions and local dynamics of acoustic
features. Innovatively, we introduce a quality-guided modality fusion strategy
that enables modality compensation under occlusion and noisy scenarios through
differentiable weight allocation. Experimental results on the Hume-Vidmimic2
dataset demonstrate that our method achieves an average Pearson correlation
coefficient of 0.35 across six emotion dimensions, outperforming the best
baseline by 40\%. Ablation studies further validate the effectiveness of the
dual-stage training strategy and dynamic fusion mechanism, providing a novel
technical pathway for fine-grained emotion analysis in open environments.
☆ TruthPrInt: Mitigating LVLM Object Hallucination Via Latent Truthful-Guided Pre-Intervention
Jinhao Duan, Fei Kong, Hao Cheng, James Diffenderfer, Bhavya Kailkhura, Lichao Sun, Xiaofeng Zhu, Xiaoshuang Shi, Kaidi Xu
Object Hallucination (OH) has been acknowledged as one of the major
trustworthy challenges in Large Vision-Language Models (LVLMs). Recent
advancements in Large Language Models (LLMs) indicate that internal states,
such as hidden states, encode the "overall truthfulness" of generated
responses. However, it remains under-explored how internal states in LVLMs
function and whether they could serve as "per-token" hallucination indicators,
which is essential for mitigating OH. In this paper, we first conduct an
in-depth exploration of LVLM internal states in relation to OH issues and
discover that (1) LVLM internal states are high-specificity per-token
indicators of hallucination behaviors. Moreover, (2) different LVLMs encode
universal patterns of hallucinations in common latent subspaces, indicating
that there exist "generic truthful directions" shared by various LVLMs. Based
on these discoveries, we propose Truthful-Guided Pre-Intervention (TruthPrInt)
that first learns the truthful direction of LVLM decoding and then applies
truthful-guided inference-time intervention during LVLM decoding. We further
propose ComnHallu to enhance both cross-LVLM and cross-data hallucination
detection transferability by constructing and aligning hallucination latent
subspaces. We evaluate TruthPrInt in extensive experimental settings, including
in-domain and out-of-domain scenarios, over popular LVLMs and OH benchmarks.
Experimental results indicate that TruthPrInt significantly outperforms
state-of-the-art methods. Codes will be available at
https://github.com/jinhaoduan/TruthPrInt.
comment: 15 pages, 9 figures, the first two authors contributed equally
☆ The Spectral Bias of Shallow Neural Network Learning is Shaped by the Choice of Non-linearity
Despite classical statistical theory predicting severe overfitting, modern
massively overparameterized neural networks still generalize well. This
unexpected property is attributed to the network's so-called implicit bias,
which describes its propensity to converge to solutions that generalize
effectively, among the many possible that correctly label the training data.
The aim of our research is to explore this bias from a new perspective,
focusing on how non-linear activation functions contribute to shaping it.
First, we introduce a reparameterization which removes a continuous weight
rescaling symmetry. Second, in the kernel regime, we leverage this
reparameterization to generalize recent findings that relate shallow Neural
Networks to the Radon transform, deriving an explicit formula for the implicit
bias induced by a broad class of activation functions. Specifically, by
utilizing the connection between the Radon transform and the Fourier transform,
we interpret the kernel regime's inductive bias as minimizing a spectral
seminorm that penalizes high-frequency components, in a manner dependent on the
activation function. Finally, in the adaptive regime, we demonstrate the
existence of local dynamical attractors that facilitate the formation of
clusters of hyperplanes where the input to a neuron's activation function is
zero, yielding alignment between many neurons' response functions. We confirm
these theoretical results with simulations. All together, our work provides a
deeper understanding of the mechanisms underlying the generalization
capabilities of overparameterized neural networks and its relation with the
implicit bias, offering potential pathways for designing more efficient and
robust models.
comment: 18 pages, 10 figures in main text
☆ VisualWebInstruct: Scaling up Multimodal Instruction Data through Web Search
Vision-Language Models have made significant progress on many
perception-focused tasks, however, their progress on reasoning-focused tasks
seem to be limited due to the lack of high-quality and diverse training data.
In this work, we aim to address the scarcity issue of reasoning-focused
multimodal datasets. We propose VisualWebInstruct - a novel approach that
leverages search engine to create a diverse, and high-quality dataset spanning
multiple disciplines like math, physics, finance, chemistry, etc. Starting with
meticulously selected 30,000 seed images, we employ Google Image search to
identify websites containing similar images. We collect and process the HTMLs
from over 700K unique URL sources. Through a pipeline of content extraction,
filtering and synthesis, we build a dataset of approximately 900K
question-answer pairs, with 40% being visual QA pairs and the rest as text QA
pairs. Models fine-tuned on VisualWebInstruct demonstrate significant
performance gains: (1) training from Llava-OV-mid shows 10-20% absolute point
gains across benchmarks, (2) training from MAmmoTH-VL shows 5% absoluate gain.
Our best model MAmmoTH-VL2 shows state-of-the-art performance within the 10B
parameter class on MMMU-Pro-std (40.7%), MathVerse (42.6%), and DynaMath
(55.7%). These remarkable results highlight the effectiveness of our dataset in
enhancing VLMs' reasoning capabilities for complex multimodal tasks.
comment: Technical Report
☆ KUDA: Keypoints to Unify Dynamics Learning and Visual Prompting for Open-Vocabulary Robotic Manipulation
With the rapid advancement of large language models (LLMs) and
vision-language models (VLMs), significant progress has been made in developing
open-vocabulary robotic manipulation systems. However, many existing approaches
overlook the importance of object dynamics, limiting their applicability to
more complex, dynamic tasks. In this work, we introduce KUDA, an
open-vocabulary manipulation system that integrates dynamics learning and
visual prompting through keypoints, leveraging both VLMs and learning-based
neural dynamics models. Our key insight is that a keypoint-based target
specification is simultaneously interpretable by VLMs and can be efficiently
translated into cost functions for model-based planning. Given language
instructions and visual observations, KUDA first assigns keypoints to the RGB
image and queries the VLM to generate target specifications. These abstract
keypoint-based representations are then converted into cost functions, which
are optimized using a learned dynamics model to produce robotic trajectories.
We evaluate KUDA on a range of manipulation tasks, including free-form language
instructions across diverse object categories, multi-object interactions, and
deformable or granular objects, demonstrating the effectiveness of our
framework. The project page is available at http://kuda-dynamics.github.io.
comment: Project website: http://kuda-dynamics.github.io
☆ Language Models, Graph Searching, and Supervision Adulteration: When More Supervision is Less and How to Make More More SC
This work concerns the path-star task, a minimal example of searching over a
graph. The graph, $G$, is star-shaped with $D$ arms radiating from a start
node, $s$. A language model (LM) is given $G$, $s$, and a target node $t$,
which ends one of the arms and is tasked with generating the arm containing
$t$. The minimal nature of this task means only a single choice needs to be
made: which of the $D$ arms contains $t$?
Decoder-only LMs fail to solve this elementary task above $1/D$ chance due to
a learned shortcut that absorbs training supervision. We show how this
pathology is caused by excess supervision and we present a series of solutions
demonstrating that the task is solvable via decoder-only LMs. We find that the
task's minimal nature causes its difficulty, as it prevents task decomposition.
Our solutions provide insight into the pathology and its implications for LMs
trained via next-token prediction.
comment: A reduced version of this work has been accepted to the Workshop on
Spurious Correlation and Shortcut Learning: Foundations and Solutions (SCSL)
at ICLR 2025. Full version under review
☆ GBSVR: Granular Ball Support Vector Regression
Support Vector Regression (SVR) and its variants are widely used to handle
regression tasks, however, since their solution involves solving an expensive
quadratic programming problem, it limits its application, especially when
dealing with large datasets. Additionally, SVR uses an epsilon-insensitive loss
function which is sensitive to outliers and therefore can adversely affect its
performance. We propose Granular Ball Support Vector Regression (GBSVR) to
tackle problem of regression by using granular ball concept. These balls are
useful in simplifying complex data spaces for machine learning tasks, however,
to the best of our knowledge, they have not been sufficiently explored for
regression problems. Granular balls group the data points into balls based on
their proximity and reduce the computational cost in SVR by replacing the large
number of data points with far fewer granular balls. This work also suggests a
discretization method for continuous-valued attributes to facilitate the
construction of granular balls. The effectiveness of the proposed approach is
evaluated on several benchmark datasets and it outperforms existing
state-of-the-art approaches
☆ The Impact of Item-Writing Flaws on Difficulty and Discrimination in Item Response Theory
High-quality test items are essential for educational assessments,
particularly within Item Response Theory (IRT). Traditional validation methods
rely on resource-intensive pilot testing to estimate item difficulty and
discrimination. More recently, Item-Writing Flaw (IWF) rubrics emerged as a
domain-general approach for evaluating test items based on textual features.
However, their relationship to IRT parameters remains underexplored. To address
this gap, we conducted a study involving over 7,000 multiple-choice questions
across various STEM subjects (e.g., math and biology). Using an automated
approach, we annotated each question with a 19-criteria IWF rubric and studied
relationships to data-driven IRT parameters. Our analysis revealed
statistically significant links between the number of IWFs and IRT difficulty
and discrimination parameters, particularly in life and physical science
domains. We further observed how specific IWF criteria can impact item quality
more and less severely (e.g., negative wording vs. implausible distractors).
Overall, while IWFs are useful for predicting IRT parameters--particularly for
screening low-difficulty MCQs--they cannot replace traditional data-driven
validation methods. Our findings highlight the need for further research on
domain-general evaluation rubrics and algorithms that understand
domain-specific content for robust item validation.
☆ Lightweight Models for Emotional Analysis in Video
In this study, we present an approach for efficient spatiotemporal feature
extraction using MobileNetV4 and a multi-scale 3D MLP-Mixer-based temporal
aggregation module. MobileNetV4, with its Universal Inverted Bottleneck (UIB)
blocks, serves as the backbone for extracting hierarchical feature
representations from input image sequences, ensuring both computational
efficiency and rich semantic encoding. To capture temporal dependencies, we
introduce a three-level MLP-Mixer module, which processes spatial features at
multiple resolutions while maintaining structural integrity. Experimental
results on the ABAW 8th competition demonstrate the effectiveness of our
approach, showing promising performance in affective behavior analysis. By
integrating an efficient vision backbone with a structured temporal modeling
mechanism, the proposed framework achieves a balance between computational
efficiency and predictive accuracy, making it well-suited for real-time
applications in mobile and embedded computing environments.
☆ PiSA: A Self-Augmented Data Engine and Training Strategy for 3D Understanding with Large Models
Zilu Guo, Hongbin Lin, Zhihao Yuan, Chaoda Zheng, Pengshuo Qiu, Dongzhi Jiang, Renrui Zhang, Chun-Mei Feng, Zhen Li
3D Multimodal Large Language Models (MLLMs) have recently made substantial
advancements. However, their potential remains untapped, primarily due to the
limited quantity and suboptimal quality of 3D datasets. Current approaches
attempt to transfer knowledge from 2D MLLMs to expand 3D instruction data, but
still face modality and domain gaps. To this end, we introduce PiSA-Engine
(Point-Self-Augmented-Engine), a new framework for generating instruction
point-language datasets enriched with 3D spatial semantics. We observe that
existing 3D MLLMs offer a comprehensive understanding of point clouds for
annotation, while 2D MLLMs excel at cross-validation by providing complementary
information. By integrating holistic 2D and 3D insights from off-the-shelf
MLLMs, PiSA-Engine enables a continuous cycle of high-quality data generation.
We select PointLLM as the baseline and adopt this co-evolution training
framework to develop an enhanced 3D MLLM, termed PointLLM-PiSA. Additionally,
we identify limitations in previous 3D benchmarks, which often feature coarse
language captions and insufficient category diversity, resulting in inaccurate
evaluations. To address this gap, we further introduce PiSA-Bench, a
comprehensive 3D benchmark covering six key aspects with detailed and diverse
labels. Experimental results demonstrate PointLLM-PiSA's state-of-the-art
performance in zero-shot 3D object captioning and generative classification on
our PiSA-Bench, achieving significant improvements of 46.45% (+8.33%) and
63.75% (+16.25%), respectively. We will release the code, datasets, and
benchmark.
comment: Technical Report
☆ CountPath: Automating Fragment Counting in Digital Pathology
Ana Beatriz Vieira, Maria Valente, Diana Montezuma, Tomé Albuquerque, Liliana Ribeiro, Domingos Oliveira, João Monteiro, Sofia Gonçalves, Isabel M. Pinto, Jaime S. Cardoso, Arlindo L. Oliveira
Quality control of medical images is a critical component of digital
pathology, ensuring that diagnostic images meet required standards. A
pre-analytical task within this process is the verification of the number of
specimen fragments, a process that ensures that the number of fragments on a
slide matches the number documented in the macroscopic report. This step is
important to ensure that the slides contain the appropriate diagnostic material
from the grossing process, thereby guaranteeing the accuracy of subsequent
microscopic examination and diagnosis. Traditionally, this assessment is
performed manually, requiring significant time and effort while being subject
to significant variability due to its subjective nature. To address these
challenges, this study explores an automated approach to fragment counting
using the YOLOv9 and Vision Transformer models. Our results demonstrate that
the automated system achieves a level of performance comparable to expert
assessments, offering a reliable and efficient alternative to manual counting.
Additionally, we present findings on interobserver variability, showing that
the automated approach achieves an accuracy of 86%, which falls within the
range of variation observed among experts (82-88%), further supporting its
potential for integration into routine pathology workflows.
comment: 10 pages, 3 figures
☆ Why the Brain Cannot Be a Digital Computer: History-Dependence and the Computational Limits of Consciousness
This paper presents a novel information-theoretic proof demonstrating that
the human brain as currently understood cannot function as a classical digital
computer. Through systematic quantification of distinguishable conscious states
and their historical dependencies, we establish that the minimum information
required to specify a conscious state exceeds the physical information capacity
of the human brain by a significant factor. Our analysis calculates the
bit-length requirements for representing consciously distinguishable sensory
"stimulus frames" and demonstrates that consciousness exhibits mandatory
temporal-historical dependencies that multiply these requirements beyond the
brain's storage capabilities. This mathematical approach offers new insights
into the fundamental limitations of computational models of consciousness and
suggests that non-classical information processing mechanisms may be necessary
to account for conscious experience.
comment: 10 pages, 1 figure
☆ Conformal Prediction Sets for Deep Generative Models via Reduction to Conformal Regression
We consider the problem of generating valid and small prediction sets by
sampling outputs (e.g., software code and natural language text) from a
black-box deep generative model for a given input (e.g., textual prompt). The
validity of a prediction set is determined by a user-defined binary
admissibility function depending on the target application. For example,
requiring at least one program in the set to pass all test cases in code
generation application. To address this problem, we develop a simple and
effective conformal inference algorithm referred to as Generative Prediction
Sets (GPS). Given a set of calibration examples and black-box access to a deep
generative model, GPS can generate prediction sets with provable guarantees.
The key insight behind GPS is to exploit the inherent structure within the
distribution over the minimum number of samples needed to obtain an admissible
output to develop a simple conformal regression approach over the minimum
number of samples. Experiments on multiple datasets for code and math word
problems using different large language models demonstrate the efficacy of GPS
over state-of-the-art methods.
☆ Explainable Bayesian deep learning through input-skip Latent Binary Bayesian Neural Networks
Modeling natural phenomena with artificial neural networks (ANNs) often
provides highly accurate predictions. However, ANNs often suffer from
over-parameterization, complicating interpretation and raising uncertainty
issues. Bayesian neural networks (BNNs) address the latter by representing
weights as probability distributions, allowing for predictive uncertainty
evaluation. Latent binary Bayesian neural networks (LBBNNs) further handle
structural uncertainty and sparsify models by removing redundant weights. This
article advances LBBNNs by enabling covariates to skip to any succeeding layer
or be excluded, simplifying networks and clarifying input impacts on
predictions. Ultimately, a linear model or even a constant can be found to be
optimal for a specific problem at hand. Furthermore, the input-skip LBBNN
approach reduces network density significantly compared to standard LBBNNs,
achieving over 99% reduction for small networks and over 99.9% for larger ones,
while still maintaining high predictive accuracy and uncertainty measurement.
For example, on MNIST, we reached 97% accuracy and great calibration with just
935 weights, reaching state-of-the-art for compression of neural networks.
Furthermore, the proposed method accurately identifies the true covariates and
adjusts for system non-linearity. The main contribution is the introduction of
active paths, enhancing directly designed global and local explanations within
the LBBNN framework, that have theoretical guarantees and do not require post
hoc external tools for explanations.
comment: 44 pages, 19 tables, 25 figures. Code available at
https://github.com/eirihoyh/ISLaB-LBBNN
☆ LLMs in Disease Diagnosis: A Comparative Study of DeepSeek-R1 and O3 Mini Across Chronic Health Conditions
Large Language Models (LLMs) are revolutionizing medical diagnostics by
enhancing both disease classification and clinical decision-making. In this
study, we evaluate the performance of two LLM- based diagnostic tools, DeepSeek
R1 and O3 Mini, using a structured dataset of symptoms and diagnoses. We
assessed their predictive accuracy at both the disease and category levels, as
well as the reliability of their confidence scores. DeepSeek R1 achieved a
disease-level accuracy of 76% and an overall accuracy of 82%, outperforming O3
Mini, which attained 72% and 75% respectively. Notably, DeepSeek R1
demonstrated exceptional performance in Mental Health, Neurological Disorders,
and Oncology, where it reached 100% accuracy, while O3 Mini excelled in
Autoimmune Disease classification with 100% accuracy. Both models, however,
struggled with Respiratory Disease classification, recording accuracies of only
40% for DeepSeek R1 and 20% for O3 Mini. Additionally, the analysis of
confidence scores revealed that DeepSeek R1 provided high-confidence
predictions in 92% of cases, compared to 68% for O3 Mini. Ethical
considerations regarding bias, model interpretability, and data privacy are
also discussed to ensure the responsible integration of LLMs into clinical
practice. Overall, our findings offer valuable insights into the strengths and
limitations of LLM-based diagnostic systems and provide a roadmap for future
enhancements in AI-driven healthcare.
comment: 12 pages, 3 figures
☆ DeclareAligner: A Leap Towards Efficient Optimal Alignments for Declarative Process Model Conformance Checking
In many engineering applications, processes must be followed precisely,
making conformance checking between event logs and declarative process models
crucial for ensuring adherence to desired behaviors. This is a critical area
where Artificial Intelligence (AI) plays a pivotal role in driving effective
process improvement. However, computing optimal alignments poses significant
computational challenges due to the vast search space inherent in these models.
Consequently, existing approaches often struggle with scalability and
efficiency, limiting their applicability in real-world settings. This paper
introduces DeclareAligner, a novel algorithm that uses the A* search algorithm,
an established AI pathfinding technique, to tackle the problem from a fresh
perspective leveraging the flexibility of declarative models. Key features of
DeclareAligner include only performing actions that actively contribute to
fixing constraint violations, utilizing a tailored heuristic to navigate
towards optimal solutions, and employing early pruning to eliminate
unproductive branches, while also streamlining the process through
preprocessing and consolidating multiple fixes into unified actions. The
proposed method is evaluated using 8,054 synthetic and real-life alignment
problems, demonstrating its ability to efficiently compute optimal alignments
by significantly outperforming the current state of the art. By enabling
process analysts to more effectively identify and understand conformance
issues, DeclareAligner has the potential to drive meaningful process
improvement and management.
☆ Siamese Foundation Models for Crystal Structure Prediction
Liming Wu, Wenbing Huang, Rui Jiao, Jianxing Huang, Liwei Liu, Yipeng Zhou, Hao Sun, Yang Liu, Fuchun Sun, Yuxiang Ren, Jirong Wen
Crystal Structure Prediction (CSP), which aims to generate stable crystal
structures from compositions, represents a critical pathway for discovering
novel materials. While structure prediction tasks in other domains, such as
proteins, have seen remarkable progress, CSP remains a relatively underexplored
area due to the more complex geometries inherent in crystal structures. In this
paper, we propose Siamese foundation models specifically designed to address
CSP. Our pretrain-finetune framework, named DAO, comprises two complementary
foundation models: DAO-G for structure generation and DAO-P for energy
prediction. Experiments on CSP benchmarks (MP-20 and MPTS-52) demonstrate that
our DAO-G significantly surpasses state-of-the-art (SOTA) methods across all
metrics. Extensive ablation studies further confirm that DAO-G excels in
generating diverse polymorphic structures, and the dataset relaxation and
energy guidance provided by DAO-P are essential for enhancing DAO-G's
performance. When applied to three real-world superconductors
($\text{CsV}_3\text{Sb}_5$, $ \text{Zr}_{16}\text{Rh}_8\text{O}_4$ and
$\text{Zr}_{16}\text{Pd}_8\text{O}_4$) that are known to be challenging to
analyze, our foundation models achieve accurate critical temperature
predictions and structure generations. For instance, on
$\text{CsV}_3\text{Sb}_5$, DAO-G generates a structure close to the
experimental one with an RMSE of 0.0085; DAO-P predicts the $T_c$ value with
high accuracy (2.26 K vs. the ground-truth value of 2.30 K). In contrast,
conventional DFT calculators like Quantum Espresso only successfully derive the
structure of the first superconductor within an acceptable time, while the RMSE
is nearly 8 times larger, and the computation speed is more than 1000 times
slower. These compelling results collectively highlight the potential of our
approach for advancing materials science research and development.
☆ DynaCode: A Dynamic Complexity-Aware Code Benchmark for Evaluating Large Language Models in Code Generation
The rapid advancement of large language models (LLMs) has significantly
improved their performance in code generation tasks. However, existing code
benchmarks remain static, consisting of fixed datasets with predefined
problems. This makes them vulnerable to memorization during training, where
LLMs recall specific test cases instead of generalizing to new problems,
leading to data contamination and unreliable evaluation results. To address
these issues, we introduce DynaCode, a dynamic, complexity-aware benchmark that
overcomes the limitations of static datasets. DynaCode evaluates LLMs
systematically using a complexity-aware metric, incorporating both code
complexity and call-graph structures. DynaCode achieves large-scale diversity,
generating up to 189 million unique nested code problems across four distinct
levels of code complexity, referred to as units, and 16 types of call graphs.
Results on 12 latest LLMs show an average performance drop of 16.8% to 45.7%
compared to MBPP+, a static code generation benchmark, with performance
progressively decreasing as complexity increases. This demonstrates DynaCode's
ability to effectively differentiate LLMs. Additionally, by leveraging call
graphs, we gain insights into LLM behavior, particularly their preference for
handling subfunction interactions within nested code.
comment: 16 pages, 11 figures
☆ Whisper Speaker Identification: Leveraging Pre-Trained Multilingual Transformers for Robust Speaker Embeddings
Speaker identification in multilingual settings presents unique challenges,
particularly when conventional models are predominantly trained on English
data. In this paper, we propose WSI (Whisper Speaker Identification), a
framework that repurposes the encoder of the Whisper automatic speech
recognition model pre trained on extensive multilingual data to generate robust
speaker embeddings via a joint loss optimization strategy that leverages online
hard triplet mining and self supervised Normalized Temperature-scaled Cross
Entropy loss. By capitalizing on Whisper language-agnostic acoustic
representations, our approach effectively distinguishes speakers across diverse
languages and recording conditions. Extensive evaluations on multiple corpora,
including VoxTube (multilingual), JVS (Japanese), CallHome (German, Spanish,
Chinese, and Japanese), and Voxconverse (English), demonstrate that WSI
consistently outperforms state-of-the-art baselines, namely Pyannote Embedding,
ECAPA TDNN, and Xvector, in terms of lower equal error rates and higher AUC
scores. These results validate our hypothesis that a multilingual pre-trained
ASR encoder, combined with joint loss optimization, substantially improves
speaker identification performance in non-English languages.
comment: 6 pages
☆ dFLMoE: Decentralized Federated Learning via Mixture of Experts for Medical Data Analysis
Luyuan Xie, Tianyu Luan, Wenyuan Cai, Guochen Yan, Zhaoyu Chen, Nan Xi, Yuejian Fang, Qingni Shen, Zhonghai Wu, Junsong Yuan
Federated learning has wide applications in the medical field. It enables
knowledge sharing among different healthcare institutes while protecting
patients' privacy. However, existing federated learning systems are typically
centralized, requiring clients to upload client-specific knowledge to a central
server for aggregation. This centralized approach would integrate the knowledge
from each client into a centralized server, and the knowledge would be already
undermined during the centralized integration before it reaches back to each
client. Besides, the centralized approach also creates a dependency on the
central server, which may affect training stability if the server malfunctions
or connections are unstable. To address these issues, we propose a
decentralized federated learning framework named dFLMoE. In our framework,
clients directly exchange lightweight head models with each other. After
exchanging, each client treats both local and received head models as
individual experts, and utilizes a client-specific Mixture of Experts (MoE)
approach to make collective decisions. This design not only reduces the
knowledge damage with client-specific aggregations but also removes the
dependency on the central server to enhance the robustness of the framework. We
validate our framework on multiple medical tasks, demonstrating that our method
evidently outperforms state-of-the-art approaches under both model homogeneity
and heterogeneity settings.
☆ RealGeneral: Unifying Visual Generation via Temporal In-Context Learning with Video Models
Unifying diverse image generation tasks within a single framework remains a
fundamental challenge in visual generation. While large language models (LLMs)
achieve unification through task-agnostic data and generation, existing visual
generation models fail to meet these principles. Current approaches either rely
on per-task datasets and large-scale training or adapt pre-trained image models
with task-specific modifications, limiting their generalizability. In this
work, we explore video models as a foundation for unified image generation,
leveraging their inherent ability to model temporal correlations. We introduce
RealGeneral, a novel framework that reformulates image generation as a
conditional frame prediction task, analogous to in-context learning in LLMs. To
bridge the gap between video models and condition-image pairs, we propose (1) a
Unified Conditional Embedding module for multi-modal alignment and (2) a
Unified Stream DiT Block with decoupled adaptive LayerNorm and attention mask
to mitigate cross-modal interference. RealGeneral demonstrates effectiveness in
multiple important visual generation tasks, e.g., it achieves a 14.5%
improvement in subject similarity for customized generation and a 10%
enhancement in image quality for canny-to-image task. Project page:
https://lyne1.github.io/RealGeneral/
☆ RoMA: Scaling up Mamba-based Foundation Models for Remote Sensing
Fengxiang Wang, Hongzhen Wang, Yulin Wang, Di Wang, Mingshuo Chen, Haiyan Zhao, Yangang Sun, Shuo Wang, Long Lan, Wenjing Yang, Jing Zhang
Recent advances in self-supervised learning for Vision Transformers (ViTs)
have fueled breakthroughs in remote sensing (RS) foundation models. However,
the quadratic complexity of self-attention poses a significant barrier to
scalability, particularly for large models and high-resolution images. While
the linear-complexity Mamba architecture offers a promising alternative,
existing RS applications of Mamba remain limited to supervised tasks on small,
domain-specific datasets. To address these challenges, we propose RoMA, a
framework that enables scalable self-supervised pretraining of Mamba-based RS
foundation models using large-scale, diverse, unlabeled data. RoMA enhances
scalability for high-resolution images through a tailored auto-regressive
learning strategy, incorporating two key innovations: 1) a rotation-aware
pretraining mechanism combining adaptive cropping with angular embeddings to
handle sparsely distributed objects with arbitrary orientations, and 2)
multi-scale token prediction objectives that address the extreme variations in
object scales inherent to RS imagery. Systematic empirical studies validate
that Mamba adheres to RS data and parameter scaling laws, with performance
scaling reliably as model and data size increase. Furthermore, experiments
across scene classification, object detection, and semantic segmentation tasks
demonstrate that RoMA-pretrained Mamba models consistently outperform ViT-based
counterparts in both accuracy and computational efficiency. The source code and
pretrained models will be released at https://github.com/MiliLab/RoMA.
☆ CINEMA: Coherent Multi-Subject Video Generation via MLLM-Based Guidance
Yufan Deng, Xun Guo, Yizhi Wang, Jacob Zhiyuan Fang, Angtian Wang, Shenghai Yuan, Yiding Yang, Bo Liu, Haibin Huang, Chongyang Ma
Video generation has witnessed remarkable progress with the advent of deep
generative models, particularly diffusion models. While existing methods excel
in generating high-quality videos from text prompts or single images,
personalized multi-subject video generation remains a largely unexplored
challenge. This task involves synthesizing videos that incorporate multiple
distinct subjects, each defined by separate reference images, while ensuring
temporal and spatial consistency. Current approaches primarily rely on mapping
subject images to keywords in text prompts, which introduces ambiguity and
limits their ability to model subject relationships effectively. In this paper,
we propose CINEMA, a novel framework for coherent multi-subject video
generation by leveraging Multimodal Large Language Model (MLLM). Our approach
eliminates the need for explicit correspondences between subject images and
text entities, mitigating ambiguity and reducing annotation effort. By
leveraging MLLM to interpret subject relationships, our method facilitates
scalability, enabling the use of large and diverse datasets for training.
Furthermore, our framework can be conditioned on varying numbers of subjects,
offering greater flexibility in personalized content creation. Through
extensive evaluations, we demonstrate that our approach significantly improves
subject consistency, and overall video coherence, paving the way for advanced
applications in storytelling, interactive media, and personalized video
generation.
☆ A Multimodal Fusion Model Leveraging MLP Mixer and Handcrafted Features-based Deep Learning Networks for Facial Palsy Detection PAKDD 2025
Algorithmic detection of facial palsy offers the potential to improve current
practices, which usually involve labor-intensive and subjective assessments by
clinicians. In this paper, we present a multimodal fusion-based deep learning
model that utilizes an MLP mixer-based model to process unstructured data (i.e.
RGB images or images with facial line segments) and a feed-forward neural
network to process structured data (i.e. facial landmark coordinates, features
of facial expressions, or handcrafted features) for detecting facial palsy. We
then contribute to a study to analyze the effect of different data modalities
and the benefits of a multimodal fusion-based approach using videos of 20
facial palsy patients and 20 healthy subjects. Our multimodal fusion model
achieved 96.00 F1, which is significantly higher than the feed-forward neural
network trained on handcrafted features alone (82.80 F1) and an MLP mixer-based
model trained on raw RGB images (89.00 F1).
comment: PAKDD 2025. arXiv admin note: text overlap with arXiv:2405.16496
☆ G-Boost: Boosting Private SLMs with General LLMs
Due to the limited computational resources, most Large Language Models (LLMs)
developers can only fine-tune Small Language Models (SLMs) on their own data.
These private SLMs typically have limited effectiveness. To boost the
performance of private SLMs, this paper proposes to ask general LLMs for help.
The general LLMs can be APIs or larger LLMs whose inference cost the developers
can afford. Specifically, we propose the G-Boost framework where a private SLM
adaptively performs collaborative inference with a general LLM under the guide
of process reward. Experiments demonstrate that our framework can significantly
boost the performance of private SLMs.
☆ Object detection characteristics in a learning factory environment using YOLOv8
AI-based object detection, and efforts to explain and investigate their
characteristics, is a topic of high interest. The impact of, e.g., complex
background structures with similar appearances as the objects of interest, on
the detection accuracy and, beforehand, the necessary dataset composition are
topics of ongoing research. In this paper, we present a systematic
investigation of background influences and different features of the object to
be detected. The latter includes various materials and surfaces, partially
transparent and with shiny reflections in the context of an Industry 4.0
learning factory. Different YOLOv8 models have been trained for each of the
materials on different sized datasets, where the appearance was the only
changing parameter. In the end, similar characteristics tend to show different
behaviours and sometimes unexpected results. While some background components
tend to be detected, others with the same features are not part of the
detection. Additionally, some more precise conclusions can be drawn from the
results. Therefore, we contribute a challenging dataset with detailed
investigations on 92 trained YOLO models, addressing some issues on the
detection accuracy and possible overfitting.
☆ KV-Distill: Nearly Lossless Learnable Context Compression for LLMs
Sequence-to-sequence tasks often benefit from long contexts, but the
quadratic complexity of self-attention in standard Transformers renders this
non-trivial. During generation, temporary representations -stored in the
so-called KV cache-account for a large portion of GPU memory usage and scale
linearly with context length. We introduce KV-Distill, a Transformer
compression framework that distills long context KV caches into significantly
shorter representations in a question-independent fashion. KV-Distill can be
trained as a parameter-efficient adaptor for pretrained models, and enables the
compression of arbitrary spans of a context while preserving pre-trained model
capabilities. We treat a compressed-uncompressed cache as a student-teacher
pairing and apply a KL-type divergence to match the generated outputs.
KV-Distill outperforms other compression techniques in worst-case extractive
tasks and approaches uncompressed performance in long context question
answering and summarization, and it can be fine-tuned on domain-specific
contexts to reduce lengths by up to 99% while preserving downstream
performance. We demonstrate the generalizability of KV-Distill across various
model sizes and architectures.
☆ OSMa-Bench: Evaluating Open Semantic Mapping Under Varying Lighting Conditions
Open Semantic Mapping (OSM) is a key technology in robotic perception,
combining semantic segmentation and SLAM techniques. This paper introduces a
dynamically configurable and highly automated LLM/LVLM-powered pipeline for
evaluating OSM solutions called OSMa-Bench (Open Semantic Mapping Benchmark).
The study focuses on evaluating state-of-the-art semantic mapping algorithms
under varying indoor lighting conditions, a critical challenge in indoor
environments. We introduce a novel dataset with simulated RGB-D sequences and
ground truth 3D reconstructions, facilitating the rigorous analysis of mapping
performance across different lighting conditions. Through experiments on
leading models such as ConceptGraphs, BBQ and OpenScene, we evaluate the
semantic fidelity of object recognition and segmentation. Additionally, we
introduce a Scene Graph evaluation method to analyze the ability of models to
interpret semantic structure. The results provide insights into the robustness
of these models, forming future research directions for developing resilient
and adaptable robotic systems. Our code is available at
https://be2rlab.github.io/OSMa-Bench/.
comment: Project page: https://be2rlab.github.io/OSMa-Bench/
☆ Enhance Exploration in Safe Reinforcement Learning with Contrastive Representation Learning
In safe reinforcement learning, agent needs to balance between exploration
actions and safety constraints. Following this paradigm, domain transfer
approaches learn a prior Q-function from the related environments to prevent
unsafe actions. However, because of the large number of false positives, some
safe actions are never executed, leading to inadequate exploration in
sparse-reward environments. In this work, we aim to learn an efficient state
representation to balance the exploration and safety-prefer action in a
sparse-reward environment. Firstly, the image input is mapped to latent
representation by an auto-encoder. A further contrastive learning objective is
employed to distinguish safe and unsafe states. In the learning phase, the
latent distance is used to construct an additional safety check, which allows
the agent to bias the exploration if it visits an unsafe state. To verify the
effectiveness of our method, the experiment is carried out in three
navigation-based MiniGrid environments. The result highlights that our method
can explore the environment better while maintaining a good balance between
safety and efficiency.
comment: Accepted at ACIIDS 2025
☆ Nash Equilibrium Constrained Auto-bidding With Bi-level Reinforcement Learning
Many online advertising platforms provide advertisers with auto-bidding
services to enhance their advertising performance. However, most existing
auto-bidding algorithms fail to accurately capture the auto-bidding problem
formulation that the platform truly faces, let alone solve it. Actually, we
argue that the platform should try to help optimize each advertiser's
performance to the greatest extent -- which makes $\epsilon$-Nash Equilibrium
($\epsilon$-NE) a necessary solution concept -- while maximizing the social
welfare of all the advertisers for the platform's long-term value. Based on
this, we introduce the \emph{Nash-Equilibrium Constrained Bidding} (NCB), a new
formulation of the auto-bidding problem from the platform's perspective.
Specifically, it aims to maximize the social welfare of all advertisers under
the $\epsilon$-NE constraint. However, the NCB problem presents significant
challenges due to its constrained bi-level structure and the typically large
number of advertisers involved. To address these challenges, we propose a
\emph{Bi-level Policy Gradient} (BPG) framework with theoretical guarantees.
Notably, its computational complexity is independent of the number of
advertisers, and the associated gradients are straightforward to compute.
Extensive simulated and real-world experiments validate the effectiveness of
the BPG framework.
☆ Bilingual Dual-Head Deep Model for Parkinson's Disease Detection from Speech ICASSP 2025
This work aims to tackle the Parkinson's disease (PD) detection problem from
the speech signal in a bilingual setting by proposing an ad-hoc dual-head deep
neural architecture for type-based binary classification. One head is
specialized for diadochokinetic patterns. The other head looks for natural
speech patterns present in continuous spoken utterances. Only one of the two
heads is operative accordingly to the nature of the input. Speech
representations are extracted from self-supervised learning (SSL) models and
wavelet transforms. Adaptive layers, convolutional bottlenecks, and contrastive
learning are exploited to reduce variations across languages. Our solution is
assessed against two distinct datasets, EWA-DB, and PC-GITA, which cover Slovak
and Spanish languages, respectively. Results indicate that conventional models
trained on a single language dataset struggle with cross-linguistic
generalization, and naive combinations of datasets are suboptimal. In contrast,
our model improves generalization on both languages, simultaneously.
comment: Accepted at ICASSP 2025 - Personal use of this material is permitted.
Permission from IEEE must be obtained for all other uses
☆ CODEI: Resource-Efficient Task-Driven Co-Design of Perception and Decision Making for Mobile Robots Applied to Autonomous Vehicles
This paper discusses the integration challenges and strategies for designing
mobile robots, by focusing on the task-driven, optimal selection of hardware
and software to balance safety, efficiency, and minimal usage of resources such
as costs, energy, computational requirements, and weight. We emphasize the
interplay between perception and motion planning in decision-making by
introducing the concept of occupancy queries to quantify the perception
requirements for sampling-based motion planners. Sensor and algorithm
performance are evaluated using False Negative Rates (FPR) and False Positive
Rates (FPR) across various factors such as geometric relationships, object
properties, sensor resolution, and environmental conditions. By integrating
perception requirements with perception performance, an Integer Linear
Programming (ILP) approach is proposed for efficient sensor and algorithm
selection and placement. This forms the basis for a co-design optimization that
includes the robot body, motion planner, perception pipeline, and computing
unit. We refer to this framework for solving the co-design problem of mobile
robots as CODEI, short for Co-design of Embodied Intelligence. A case study on
developing an Autonomous Vehicle (AV) for urban scenarios provides actionable
information for designers, and shows that complex tasks escalate resource
demands, with task performance affecting choices of the autonomy stack. The
study demonstrates that resource prioritization influences sensor choice:
cameras are preferred for cost-effective and lightweight designs, while lidar
sensors are chosen for better energy and computational efficiency.
comment: 20 pages, 33 images, IEEE Transactions on Robotics
☆ PyGDA: A Python Library for Graph Domain Adaptation
Graph domain adaptation has emerged as a promising approach to facilitate
knowledge transfer across different domains. Recently, numerous models have
been proposed to enhance their generalization capabilities in this field.
However, there is still no unified library that brings together existing
techniques and simplifies their implementation. To fill this gap, we introduce
PyGDA, an open-source Python library tailored for graph domain adaptation. As
the first comprehensive library in this area, PyGDA covers more than 20 widely
used graph domain adaptation methods together with different types of graph
datasets. Specifically, PyGDA offers modular components, enabling users to
seamlessly build custom models with a variety of commonly used utility
functions. To handle large-scale graphs, PyGDA includes support for features
such as sampling and mini-batch processing, ensuring efficient computation. In
addition, PyGDA also includes comprehensive performance benchmarks and
well-documented user-friendly API for both researchers and practitioners. To
foster convenient accessibility, PyGDA is released under the MIT license at
https://github.com/pygda-team/pygda, and the API documentation is
https://pygda.readthedocs.io/en/stable/.
comment: Under Review
☆ SurgRAW: Multi-Agent Workflow with Chain-of-Thought Reasoning for Surgical Intelligence
Integration of Vision-Language Models (VLMs) in surgical intelligence is
hindered by hallucinations, domain knowledge gaps, and limited understanding of
task interdependencies within surgical scenes, undermining clinical
reliability. While recent VLMs demonstrate strong general reasoning and
thinking capabilities, they still lack the domain expertise and task-awareness
required for precise surgical scene interpretation. Although Chain-of-Thought
(CoT) can structure reasoning more effectively, current approaches rely on
self-generated CoT steps, which often exacerbate inherent domain gaps and
hallucinations. To overcome this, we present SurgRAW, a CoT-driven multi-agent
framework that delivers transparent, interpretable insights for most tasks in
robotic-assisted surgery. By employing specialized CoT prompts across five
tasks: instrument recognition, action recognition, action prediction, patient
data extraction, and outcome assessment, SurgRAW mitigates hallucinations
through structured, domain-aware reasoning. Retrieval-Augmented Generation
(RAG) is also integrated to external medical knowledge to bridge domain gaps
and improve response reliability. Most importantly, a hierarchical agentic
system ensures that CoT-embedded VLM agents collaborate effectively while
understanding task interdependencies, with a panel discussion mechanism
promotes logical consistency. To evaluate our method, we introduce
SurgCoTBench, the first reasoning-based dataset with structured frame-level
annotations. With comprehensive experiments, we demonstrate the effectiveness
of proposed SurgRAW with 29.32% accuracy improvement over baseline VLMs on 12
robotic procedures, achieving the state-of-the-art performance and advancing
explainable, trustworthy, and autonomous surgical assistance.
☆ PIMRL: Physics-Informed Multi-Scale Recurrent Learning for Spatiotemporal Prediction
Simulation of spatiotemporal systems governed by partial differential
equations is widely applied in fields such as biology, chemistry, aerospace
dynamics, and meteorology. Traditional numerical methods incur high
computational costs due to the requirement of small time steps for accurate
predictions. While machine learning has reduced these costs, long-term
predictions remain challenged by error accumulation, particularly in scenarios
with insufficient data or varying time scales, where stability and accuracy are
compromised. Existing methods often neglect the effective utilization of
multi-scale data, leading to suboptimal robustness in predictions. To address
these issues, we propose a novel multi-scale learning framework, namely, the
Physics-Informed Multi-Scale Recurrent Learning (PIMRL), to effectively
leverage multi-scale data for spatiotemporal dynamics prediction. The PIMRL
framework comprises two modules: the micro-scale module embeds physical
knowledge into neural networks via pretraining, and the macro-scale module
adopts a data-driven approach to learn the temporal evolution of physics in the
latent space. Experimental results demonstrate that the PIMRL framework
consistently achieves state-of-the-art performance across five benchmark
datasets ranging from one to three dimensions, showing average improvements of
over 9\% in both RMSE and MAE evaluation metrics, with maximum enhancements
reaching up to 80%.
☆ LLM Agents Display Human Biases but Exhibit Distinct Learning Patterns
We investigate the choice patterns of Large Language Models (LLMs) in the
context of Decisions from Experience tasks that involve repeated choice and
learning from feedback, and compare their behavior to human participants. We
find that on the aggregate, LLMs appear to display behavioral biases similar to
humans: both exhibit underweighting rare events and correlation effects.
However, more nuanced analyses of the choice patterns reveal that this happens
for very different reasons. LLMs exhibit strong recency biases, unlike humans,
who appear to respond in more sophisticated ways. While these different
processes may lead to similar behavior on average, choice patterns contingent
on recent events differ vastly between the two groups. Specifically, phenomena
such as ``surprise triggers change" and the ``wavy recency effect of rare
events" are robustly observed in humans, but entirely absent in LLMs. Our
findings provide insights into the limitations of using LLMs to simulate and
predict humans in learning environments and highlight the need for refined
analyses of their behavior when investigating whether they replicate human
decision making tendencies.
☆ MinorBench: A hand-built benchmark for content-based risks for children
Large Language Models (LLMs) are rapidly entering children's lives - through
parent-driven adoption, schools, and peer networks - yet current AI ethics and
safety research do not adequately address content-related risks specific to
minors. In this paper, we highlight these gaps with a real-world case study of
an LLM-based chatbot deployed in a middle school setting, revealing how
students used and sometimes misused the system. Building on these findings, we
propose a new taxonomy of content-based risks for minors and introduce
MinorBench, an open-source benchmark designed to evaluate LLMs on their ability
to refuse unsafe or inappropriate queries from children. We evaluate six
prominent LLMs under different system prompts, demonstrating substantial
variability in their child-safety compliance. Our results inform practical
steps for more robust, child-focused safety mechanisms and underscore the
urgency of tailoring AI systems to safeguard young users.
☆ Efficient Federated Fine-Tuning of Large Language Models with Layer Dropout
Fine-tuning plays a crucial role in enabling pre-trained LLMs to evolve from
general language comprehension to task-specific expertise. To preserve user
data privacy, federated fine-tuning is often employed and has emerged as the de
facto paradigm. However, federated fine-tuning is prohibitively inefficient due
to the tension between LLM complexity and the resource constraint of end
devices, incurring unaffordable fine-tuning overhead. Existing literature
primarily utilizes parameter-efficient fine-tuning techniques to mitigate
communication costs, yet computational and memory burdens continue to pose
significant challenges for developers. This work proposes DropPEFT, an
innovative federated PEFT framework that employs a novel stochastic transformer
layer dropout method, enabling devices to deactivate a considerable fraction of
LLMs layers during training, thereby eliminating the associated computational
load and memory footprint. In DropPEFT, a key challenge is the proper
configuration of dropout ratios for layers, as overhead and training
performance are highly sensitive to this setting. To address this challenge, we
adaptively assign optimal dropout-ratio configurations to devices through an
exploration-exploitation strategy, achieving efficient and effective
fine-tuning. Extensive experiments show that DropPEFT can achieve a
1.3-6.3\times speedup in model convergence and a 40%-67% reduction in memory
footprint compared to state-of-the-art methods.
comment: 13 pages
☆ Adaptive Preference Aggregation
AI alignment, the challenge of ensuring AI systems act in accordance with
human values, has emerged as a critical problem in the development of systems
such as foundation models and recommender systems. Still, the current dominant
approach, reinforcement learning with human feedback (RLHF) faces known
theoretical limitations in aggregating diverse human preferences. Social choice
theory provides a framework to aggregate preferences, but was not developed for
the multidimensional applications typical of AI. Leveraging insights from a
recently published urn process, this work introduces a preference aggregation
strategy that adapts to the user's context and that inherits the good
properties of the maximal lottery, a Condorcet-consistent solution concept.
☆ Deep Learning for Time Series Forecasting: A Survey
Xiangjie Kong, Zhenghao Chen, Weiyao Liu, Kaili Ning, Lechao Zhang, Syauqie Muhammad Marier, Yichen Liu, Yuhao Chen, Feng Xia
Time series forecasting (TSF) has long been a crucial task in both industry
and daily life. Most classical statistical models may have certain limitations
when applied to practical scenarios in fields such as energy, healthcare,
traffic, meteorology, and economics, especially when high accuracy is required.
With the continuous development of deep learning, numerous new models have
emerged in the field of time series forecasting in recent years. However,
existing surveys have not provided a unified summary of the wide range of model
architectures in this field, nor have they given detailed summaries of works in
feature extraction and datasets. To address this gap, in this review, we
comprehensively study the previous works and summarize the general paradigms of
Deep Time Series Forecasting (DTSF) in terms of model architectures. Besides,
we take an innovative approach by focusing on the composition of time series
and systematically explain important feature extraction methods. Additionally,
we provide an overall compilation of datasets from various domains in existing
works. Finally, we systematically emphasize the significant challenges faced
and future research directions in this field.
☆ Predicting Chemical Reaction Outcomes Based on Electron Movements Using Machine Learning
Accurately predicting chemical reaction outcomes and potential byproducts is
a fundamental task of modern chemistry, enabling the efficient design of
synthetic pathways and driving progress in chemical science. Reaction
mechanism, which tracks electron movements during chemical reactions, is
critical for understanding reaction kinetics and identifying unexpected
products. Here, we present Reactron, the first electron-based machine learning
model for general reaction prediction. Reactron integrates electron movement
into its predictions, generating detailed arrow-pushing diagrams that elucidate
each mechanistic step leading to product formation. We demonstrate the high
predictive performance of Reactron over existing product-only models by a
large-scale reaction outcome prediction benchmark, and the adaptability of the
model to learn new reactivity upon providing a few examples. Furthermore, it
explores combinatorial reaction spaces, uncovering novel reactivities beyond
its training data. With robust performance in both in- and out-of-distribution
predictions, Reactron embodies human-like reasoning in chemistry and opens new
frontiers in reaction discovery and synthesis design.
comment: 15 pages, 3 figures
☆ Robustness Tokens: Towards Adversarial Robustness of Transformers ECCV
Recently, large pre-trained foundation models have become widely adopted by
machine learning practitioners for a multitude of tasks. Given that such models
are publicly available, relying on their use as backbone models for downstream
tasks might result in high vulnerability to adversarial attacks crafted with
the same public model. In this work, we propose Robustness Tokens, a novel
approach specific to the transformer architecture that fine-tunes a few
additional private tokens with low computational requirements instead of tuning
model parameters as done in traditional adversarial training. We show that
Robustness Tokens make Vision Transformer models significantly more robust to
white-box adversarial attacks while also retaining the original downstream
performances.
comment: This paper has been accepted for publication at the European
Conference on Computer Vision (ECCV), 2024
☆ Multi-Agent Q-Learning Dynamics in Random Networks: Convergence due to Exploration and Sparsity
Beyond specific settings, many multi-agent learning algorithms fail to
converge to an equilibrium solution, and instead display complex,
non-stationary behaviours such as recurrent or chaotic orbits. In fact, recent
literature suggests that such complex behaviours are likely to occur when the
number of agents increases. In this paper, we study Q-learning dynamics in
network polymatrix games where the network structure is drawn from classical
random graph models. In particular, we focus on the Erdos-Renyi model, a
well-studied model for social networks, and the Stochastic Block model, which
generalizes the above by accounting for community structures within the
network. In each setting, we establish sufficient conditions under which the
agents' joint strategies converge to a unique equilibrium. We investigate how
this condition depends on the exploration rates, payoff matrices and,
crucially, the sparsity of the network. Finally, we validate our theoretical
findings through numerical simulations and demonstrate that convergence can be
reliably achieved in many-agent systems, provided network sparsity is
controlled.
☆ Through the Magnifying Glass: Adaptive Perception Magnification for Hallucination-Free VLM Decoding
Existing vision-language models (VLMs) often suffer from visual
hallucination, where the generated responses contain inaccuracies that are not
grounded in the visual input. Efforts to address this issue without model
finetuning primarily mitigate hallucination by reducing biases contrastively or
amplifying the weights of visual embedding during decoding. However, these
approaches improve visual perception at the cost of impairing the language
reasoning capability. In this work, we propose the Perception Magnifier (PM), a
novel visual decoding method that iteratively isolates relevant visual tokens
based on attention and magnifies the corresponding regions, spurring the model
to concentrate on fine-grained visual details during decoding. Specifically, by
magnifying critical regions while preserving the structural and contextual
information at each decoding step, PM allows the VLM to enhance its scrutiny of
the visual input, hence producing more accurate and faithful responses.
Extensive experimental results demonstrate that PM not only achieves superior
hallucination mitigation but also enhances language generation while preserving
strong reasoning capabilities.Code is available at
https://github.com/ShunqiM/PM .
comment: 19 pages, 5 figures, 9 tables
☆ ImageScope: Unifying Language-Guided Image Retrieval via Large Multimodal Model Collective Reasoning WWW 2025
With the proliferation of images in online content, language-guided image
retrieval (LGIR) has emerged as a research hotspot over the past decade,
encompassing a variety of subtasks with diverse input forms. While the
development of large multimodal models (LMMs) has significantly facilitated
these tasks, existing approaches often address them in isolation, requiring the
construction of separate systems for each task. This not only increases system
complexity and maintenance costs, but also exacerbates challenges stemming from
language ambiguity and complex image content, making it difficult for retrieval
systems to provide accurate and reliable results. To this end, we propose
ImageScope, a training-free, three-stage framework that leverages collective
reasoning to unify LGIR tasks. The key insight behind the unification lies in
the compositional nature of language, which transforms diverse LGIR tasks into
a generalized text-to-image retrieval process, along with the reasoning of LMMs
serving as a universal verification to refine the results. To be specific, in
the first stage, we improve the robustness of the framework by synthesizing
search intents across varying levels of semantic granularity using
chain-of-thought (CoT) reasoning. In the second and third stages, we then
reflect on retrieval results by verifying predicate propositions locally, and
performing pairwise evaluations globally. Experiments conducted on six LGIR
datasets demonstrate that ImageScope outperforms competitive baselines.
Comprehensive evaluations and ablation studies further confirm the
effectiveness of our design.
comment: WWW 2025
☆ Retrieval-Augmented Generation with Hierarchical Knowledge
Haoyu Huang, Yongfeng Huang, Junjie Yang, Zhenyu Pan, Yongqiang Chen, Kaili Ma, Hongzhi Chen, James Cheng
Graph-based Retrieval-Augmented Generation (RAG) methods have significantly
enhanced the performance of large language models (LLMs) in domain-specific
tasks. However, existing RAG methods do not adequately utilize the naturally
inherent hierarchical knowledge in human cognition, which limits the
capabilities of RAG systems. In this paper, we introduce a new RAG approach,
called HiRAG, which utilizes hierarchical knowledge to enhance the semantic
understanding and structure capturing capabilities of RAG systems in the
indexing and retrieval processes. Our extensive experiments demonstrate that
HiRAG achieves significant performance improvements over the state-of-the-art
baseline methods. The code of our proposed method is available at
\href{https://github.com/hhy-huang/HiRAG}{https://github.com/hhy-huang/HiRAG}.
☆ Multiplicative Learning
Efficient training of artificial neural networks remains a key challenge in
deep learning. Backpropagation (BP), the standard learning algorithm, relies on
gradient descent and typically requires numerous iterations for convergence. In
this study, we introduce Expectation Reflection (ER), a novel learning approach
that updates weights multiplicatively based on the ratio of observed to
predicted outputs. Unlike traditional methods, ER maintains consistency without
requiring ad hoc loss functions or learning rate hyperparameters. We extend ER
to multilayer networks and demonstrate its effectiveness in performing image
classification tasks. Notably, ER achieves optimal weight updates in a single
iteration. Additionally, we reinterpret ER as a modified form of gradient
descent incorporating the inverse mapping of target propagation. These findings
suggest that ER provides an efficient and scalable alternative for training
neural networks.
☆ Gumiho: A Hybrid Architecture to Prioritize Early Tokens in Speculative Decoding
Speculative decoding (SPD) aims to accelerate the auto-regressive token
generation process of a target Large Language Model (LLM). Some approaches
employ a draft model with multiple heads to predict a sequence of future
tokens, where each head handles a token in the sequence. The target LLM
verifies the predicted sequence and accepts aligned tokens, enabling efficient
multi-token generation. However, existing methods assume that all tokens within
a sequence are equally important, employing identical head structures and
relying on a single-generation paradigm, either serial or parallel. To this
end, we theoretically demonstrate that initial tokens in the draft sequence are
more important than later ones. Building on this insight, we propose Gumiho, a
hybrid model combining serial and parallel heads. Specifically, given the
critical importance of early tokens, we employ a sophisticated Transformer
architecture for the early draft heads in a serial configuration to improve
accuracy. For later tokens, we utilize multiple lightweight MLP heads operating
in parallel to enhance efficiency. By allocating more advanced model structures
and longer running times to the early heads, Gumiho achieves improved overall
performance. The experimental results demonstrate that our method outperforms
existing approaches, fully validating its effectiveness.
comment: Paper under review
☆ Deep Learning-Based Direct Leaf Area Estimation using Two RGBD Datasets for Model Development
Estimation of a single leaf area can be a measure of crop growth and a
phenotypic trait to breed new varieties. It has also been used to measure leaf
area index and total leaf area. Some studies have used hand-held cameras, image
processing 3D reconstruction and unsupervised learning-based methods to
estimate the leaf area in plant images. Deep learning works well for object
detection and segmentation tasks; however, direct area estimation of objects
has not been explored. This work investigates deep learning-based leaf area
estimation, for RGBD images taken using a mobile camera setup in real-world
scenarios. A dataset for attached leaves captured with a top angle view and a
dataset for detached single leaves were collected for model development and
testing. First, image processing-based area estimation was tested on manually
segmented leaves. Then a Mask R-CNN-based model was investigated, and modified
to accept RGBD images and to estimate the leaf area. The detached-leaf data set
was then mixed with the attached-leaf plant data set to estimate the single
leaf area for plant images, and another network design with two backbones was
proposed: one for segmentation and the other for area estimation. Instead of
trying all possibilities or random values, an agile approach was used in
hyperparameter tuning. The final model was cross-validated with 5-folds and
tested with two unseen datasets: detached and attached leaves. The F1 score
with 90% IoA for segmentation result on unseen detached-leaf data was 1.0,
while R-squared of area estimation was 0.81. For unseen plant data
segmentation, the F1 score with 90% IoA was 0.59, while the R-squared score was
0.57. The research suggests using attached leaves with ground truth area to
improve the results.
☆ StepMathAgent: A Step-Wise Agent for Evaluating Mathematical Processes through Tree-of-Error
Evaluating mathematical capabilities is critical for assessing the overall
performance of large language models (LLMs). However, existing evaluation
methods often focus solely on final answers, resulting in highly inaccurate and
uninterpretable evaluation outcomes, as well as their failure to assess proof
or open-ended problems. To address these issues, we propose a novel
mathematical process evaluation agent based on Tree-of-Error, called
StepMathAgent. This agent incorporates four internal core operations: logical
step segmentation, step scoring, score aggregation and error tree generation,
along with four external extension modules: difficulty calibration, simplicity
evaluation, completeness validation and format assessment. Furthermore, we
introduce StepMathBench, a benchmark comprising 1,000 step-divided process
evaluation instances, derived from 200 high-quality math problems grouped by
problem type, subject category and difficulty level. Experiments on
StepMathBench show that our proposed StepMathAgent outperforms all
state-of-the-art methods, demonstrating human-aligned evaluation preferences
and broad applicability to various scenarios. Our data and code are available
at https://github.com/SHU-XUN/StepMathAgent.
☆ Cognitive-Mental-LLM: Leveraging Reasoning in Large Language Models for Mental Health Prediction via Online Text
Large Language Models (LLMs) have demonstrated potential in predicting mental
health outcomes from online text, yet traditional classification methods often
lack interpretability and robustness. This study evaluates structured reasoning
techniques-Chain-of-Thought (CoT), Self-Consistency (SC-CoT), and
Tree-of-Thought (ToT)-to improve classification accuracy across multiple mental
health datasets sourced from Reddit. We analyze reasoning-driven prompting
strategies, including Zero-shot CoT and Few-shot CoT, using key performance
metrics such as Balanced Accuracy, F1 score, and Sensitivity/Specificity. Our
findings indicate that reasoning-enhanced techniques improve classification
performance over direct prediction, particularly in complex cases. Compared to
baselines such as Zero Shot non-CoT Prompting, and fine-tuned pre-trained
transformers such as BERT and Mental-RoBerta, and fine-tuned Open Source LLMs
such as Mental Alpaca and Mental-Flan-T5, reasoning-driven LLMs yield notable
gains on datasets like Dreaddit (+0.52\% over M-LLM, +0.82\% over BERT) and
SDCNL (+4.67\% over M-LLM, +2.17\% over BERT). However, performance declines in
Depression Severity, and CSSRS predictions suggest dataset-specific
limitations, likely due to our using a more extensive test set. Among prompting
strategies, Few-shot CoT consistently outperforms others, reinforcing the
effectiveness of reasoning-driven LLMs. Nonetheless, dataset variability
highlights challenges in model reliability and interpretability. This study
provides a comprehensive benchmark of reasoning-based LLM techniques for mental
health text classification. It offers insights into their potential for
scalable clinical applications while identifying key challenges for future
improvements.
comment: 8 pages, 4 Figures, 3 tables
☆ Semantic Synergy: Unlocking Policy Insights and Learning Pathways Through Advanced Skill Mapping
This research introduces a comprehensive system based on state-of-the-art
natural language processing, semantic embedding, and efficient search
techniques for retrieving similarities and thus generating actionable insights
from raw textual information. The system automatically extracts and aggregates
normalized competencies from multiple documents (such as policy files and
curricula vitae) and creates strong relationships between recognized
competencies, occupation profiles, and related learning courses. To validate
its performance, we conducted a multi-tier evaluation that included both
explicit and implicit skill references in synthetic and real-world documents.
The results showed near-human-level accuracy, with F1 scores exceeding 0.95 for
explicit skill detection and above 0.93 for implicit mentions. The system
thereby establishes a sound foundation for supporting in-depth collaboration
across the AE4RIA network. The methodology involves a multi-stage pipeline
based on extensive preprocessing and data cleaning, semantic embedding and
segmentation via SentenceTransformer, and skill extraction using a FAISS-based
search method. The extracted skills are associated with occupation frameworks
(as formulated in the ESCO ontology) and with learning paths offered through
the Sustainable Development Goals Academy. Moreover, interactive visualization
software, implemented with Dash and Plotly, presents graphs and tables for
real-time exploration and informed decision-making by those involved in
policymaking, training and learning supply, career transitions, and
recruitment. Overall, this system, backed by rigorous validation, offers
promising prospects for improved policymaking, human resource development, and
lifelong learning by providing structured and actionable insights from raw,
complex textual information.
☆ Parallelizing Multi-objective A* Search
The Multi-objective Shortest Path (MOSP) problem is a classic network
optimization problem that aims to find all Pareto-optimal paths between two
points in a graph with multiple edge costs. Recent studies on multi-objective
search with A* (MOA*) have demonstrated superior performance in solving
difficult MOSP instances. This paper presents a novel search framework that
allows efficient parallelization of MOA* with different objective orders. The
framework incorporates a unique upper bounding strategy that helps the search
reduce the problem's dimensionality to one in certain cases. Experimental
results demonstrate that the proposed framework can enhance the performance of
recent A*-based solutions, with the speed-up proportional to the problem
dimension.
comment: 8 page, 2 tables, 2 figures
☆ Advanced Tool Learning and Selection System (ATLASS): A Closed-Loop Framework Using LLM
Mohd Ariful Haque, Justin Williams, Sunzida Siddique, Md. Hujaifa Islam, Hasmot Ali, Kishor Datta Gupta, Roy George
The combination of LLM agents with external tools enables models to solve
complex tasks beyond their knowledge base. Human-designed tools are inflexible
and restricted to solutions within the scope of pre-existing tools created by
experts. To address this problem, we propose ATLASS, an advanced tool learning
and selection system designed as a closed-loop framework. It enables the LLM to
solve problems by dynamically generating external tools on demand. In this
framework, agents play a crucial role in orchestrating tool selection,
execution, and refinement, ensuring adaptive problem-solving capabilities. The
operation of ATLASS follows three phases: The first phase, Understanding Tool
Requirements, involves the Agents determining whether tools are required and
specifying their functionality; the second phase, Tool Retrieval/Generation,
involves the Agents retrieving or generating tools based on their availability;
and the third phase, Task Solving, involves combining all the component tools
necessary to complete the initial task. The Tool Dataset stores the generated
tools, ensuring reusability and minimizing inference cost. Current LLM-based
tool generation systems have difficulty creating complex tools that need APIs
or external packages. In ATLASS, we solve the problem by automatically setting
up the environment, fetching relevant API documentation online, and using a
Python interpreter to create a reliable, versatile tool that works in a wider
range of situations. OpenAI GPT-4.0 is used as the LLM agent, and safety and
ethical concerns are handled through human feedback before executing generated
code. By addressing the limitations of predefined toolsets and enhancing
adaptability, ATLASS serves as a real-world solution that empowers users with
dynamically generated tools for complex problem-solving.
☆ AhaRobot: A Low-Cost Open-Source Bimanual Mobile Manipulator for Embodied AI
Navigation and manipulation in open-world environments remain unsolved
challenges in the Embodied AI. The high cost of commercial mobile manipulation
robots significantly limits research in real-world scenes. To address this
issue, we propose AhaRobot, a low-cost and fully open-source dual-arm mobile
manipulation robot system with a hardware cost of only $1,000 (excluding
optional computational resources), which is less than 1/15 of the cost of
popular mobile robots. The AhaRobot system consists of three components: (1) a
novel low-cost hardware architecture primarily composed of off-the-shelf
components, (2) an optimized control solution to enhance operational precision
integrating dual-motor backlash control and static friction compensation, and
(3) a simple remote teleoperation method RoboPilot. We use handles to control
the dual arms and pedals for whole-body movement. The teleoperation process is
low-burden and easy to operate, much like piloting. RoboPilot is designed for
remote data collection in embodied scenarios. Experimental results demonstrate
that RoboPilot significantly enhances data collection efficiency in complex
manipulation tasks, achieving a 30% increase compared to methods using 3D mouse
and leader-follower systems. It also excels at completing extremely
long-horizon tasks in one go. Furthermore, AhaRobot can be used to learn
end-to-end policies and autonomously perform complex manipulation tasks, such
as pen insertion and cleaning up the floor. We aim to build an affordable yet
powerful platform to promote the development of embodied tasks on real devices,
advancing more robust and reliable embodied AI. All hardware and software
systems are available at https://aha-robot.github.io.
comment: The first two authors contributed equally. Website:
https://aha-robot.github.io
☆ Compute Optimal Scaling of Skills: Knowledge vs Reasoning
Scaling laws are a critical component of the LLM development pipeline, most
famously as a way to forecast training decisions such as 'compute-optimally'
trading-off parameter count and dataset size, alongside a more recent growing
list of other crucial decisions. In this work, we ask whether compute-optimal
scaling behaviour can be skill-dependent. In particular, we examine knowledge
and reasoning-based skills such as knowledge-based QA and code generation, and
we answer this question in the affirmative: $\textbf{scaling laws are
skill-dependent}$. Next, to understand whether skill-dependent scaling is an
artefact of the pretraining datamix, we conduct an extensive ablation of
different datamixes and find that, also when correcting for datamix
differences, $\textbf{knowledge and code exhibit fundamental differences in
scaling behaviour}$. We conclude with an analysis of how our findings relate to
standard compute-optimal scaling using a validation set, and find that
$\textbf{a misspecified validation set can impact compute-optimal parameter
count by nearly 50%,}$ depending on its skill composition.
☆ Deep Learning Approaches for Anti-Money Laundering on Mobile Transactions: Review, Framework, and Directions
Jiani Fan, Lwin Khin Shar, Ruichen Zhang, Ziyao Liu, Wenzhuo Yang, Dusit Niyato, Bomin Mao, Kwok-Yan Lam
Money laundering is a financial crime that obscures the origin of illicit
funds, necessitating the development and enforcement of anti-money laundering
(AML) policies by governments and organizations. The proliferation of mobile
payment platforms and smart IoT devices has significantly complicated AML
investigations. As payment networks become more interconnected, there is an
increasing need for efficient real-time detection to process large volumes of
transaction data on heterogeneous payment systems by different operators such
as digital currencies, cryptocurrencies and account-based payments. Most of
these mobile payment networks are supported by connected devices, many of which
are considered loT devices in the FinTech space that constantly generate data.
Furthermore, the growing complexity and unpredictability of transaction
patterns across these networks contribute to a higher incidence of false
positives. While machine learning solutions have the potential to enhance
detection efficiency, their application in AML faces unique challenges, such as
addressing privacy concerns tied to sensitive financial data and managing the
real-world constraint of limited data availability due to data regulations.
Existing surveys in the AML literature broadly review machine learning
approaches for money laundering detection, but they often lack an in-depth
exploration of advanced deep learning techniques - an emerging field with
significant potential. To address this gap, this paper conducts a comprehensive
review of deep learning solutions and the challenges associated with their use
in AML. Additionally, we propose a novel framework that applies the
least-privilege principle by integrating machine learning techniques, codifying
AML red flags, and employing account profiling to provide context for
predictions and enable effective fraud detection under limited data
availability....
☆ DTA: Dual Temporal-channel-wise Attention for Spiking Neural Networks WACV
Spiking Neural Networks (SNNs) present a more energy-efficient alternative to
Artificial Neural Networks (ANNs) by harnessing spatio-temporal dynamics and
event-driven spikes. Effective utilization of temporal information is crucial
for SNNs, leading to the exploration of attention mechanisms to enhance this
capability. Conventional attention operations either apply identical operation
or employ non-identical operations across target dimensions. We identify that
these approaches provide distinct perspectives on temporal information. To
leverage the strengths of both operations, we propose a novel Dual
Temporal-channel-wise Attention (DTA) mechanism that integrates both
identical/non-identical attention strategies. To the best of our knowledge,
this is the first attempt to concentrate on both the correlation and dependency
of temporal-channel using both identical and non-identical attention
operations. Experimental results demonstrate that the DTA mechanism achieves
state-of-the-art performance on both static datasets (CIFAR10, CIFAR100,
ImageNet-1k) and dynamic dataset (CIFAR10-DVS), elevating spike representation
and capturing complex temporal-channel relationship. We open-source our code:
https://github.com/MnJnKIM/DTA-SNN.
comment: Accepted by IEEE/CVF Winter Conference on Applications of Computer
Vision (WACV) 2025
☆ Rapid analysis of point-contact Andreev reflection spectra via machine learning with adaptive data augmentation
Delineating the superconducting order parameters is a pivotal task in
investigating superconductivity for probing pairing mechanisms, as well as
their symmetry and topology. Point-contact Andreev reflection (PCAR)
measurement is a simple yet powerful tool for identifying the order parameters.
The PCAR spectra exhibit significant variations depending on the type of the
order parameter in a superconductor, including its magnitude
($\mathit{\Delta}$), as well as temperature, interfacial quality, Fermi
velocity mismatch, and other factors. The information on the order parameter
can be obtained by finding the combination of these parameters, generating a
theoretical spectrum that fits a measured experimental spectrum. However, due
to the complexity of the spectra and the high dimensionality of parameters,
extracting the fitting parameters is often time-consuming and labor-intensive.
In this study, we employ a convolutional neural network (CNN) algorithm to
create models for rapid and automated analysis of PCAR spectra of various
superconductors with different pairing symmetries (conventional $s$-wave,
chiral $p_x+ip_y$-wave, and $d_{x^2-y^2}$-wave). The training datasets are
generated based on the Blonder-Tinkham-Klapwijk (BTK) theory and further
modified and augmented by selectively incorporating noise and peaks according
to the bias voltages. This approach not only replicates the experimental
spectra but also brings the model's attention to important features within the
spectra. The optimized models provide fitting parameters for experimentally
measured spectra in less than 100 ms per spectrum. Our approaches and findings
pave the way for rapid and automated spectral analysis which will help
accelerate research on superconductors with complex order parameters.
comment: 18 pages, 3 figures
☆ OR-LLM-Agent: Automating Modeling and Solving of Operations Research Optimization Problem with Reasoning Large Language Model
Operations Research (OR) has been widely applied in various fields such as
resource allocation, production planning, and supply chain management. However,
addressing real-world OR problems requires OR experts to perform mathematical
modeling and programmers to develop solution algorithms. This traditional
method, heavily reliant on experts, is costly and has long development cycles,
severely limiting the widespread adoption of OR techniques. Few have considered
using Artificial Intelligence (AI) to replace professionals to achieve fully
automated solutions for OR problems. We propose OR-LLM-Agent, the first AI
agent that enables end-to-end automation for solving real-world OR problems.
OR-LLM-Agent leverages the Chain-of-Thought (CoT) reasoning capabilities of
Large Language Models (LLMs) to translate natural language problem descriptions
into formal mathematical models and automatically generate Gurobi solver code.
In OR-LLM-Agent, OR-CodeAgent is designed to automate code execution and repair
within a sandbox environment, facilitating the derivation of the final
solution. Due to the lack of dedicated benchmark datasets for evaluating the
automated solving of OR problems, we construct a benchmark dataset comprising
83 real-world OR problems described in natural language. We conduct comparative
experiments with state-of-the-art (SOTA) reasoning LLMs, including GPT-o3-mini,
DeepSeek-R1, and Gemini 2.0 Flash Thinking. The OR-LLM-Agent achieved the
highest pass rate of 100% and the highest solution accuracy of 85%,
demonstrating the feasibility of automated OR problem-solving. Data and code
have been publicly available at https://github.com/bwz96sco/or_llm_agent.
comment: 11 pages, 6 figures
☆ A New Benchmark for Few-Shot Class-Incremental Learning: Redefining the Upper Bound
Class-incremental learning (CIL) aims to continuously adapt to emerging
classes while retaining knowledge of previously learned ones. Few-shot
class-incremental learning (FSCIL) presents an even greater challenge which
requires the model to learn incremental classes with only a limited number of
samples. In conventional CIL, joint training is widely considered the upper
bound, serving as both a benchmark and a methodological guide. However, we find
that joint training fails to be a meaningful upper bound in FSCIL due to the
inherent difficulty of inter-task class separation (ICS) caused by severe class
imbalance. In this work, we introduce a new joint training benchmark tailored
for FSCIL by integrating imbalance-aware techniques, effectively bridging the
performance gap between base and incremental classes. Furthermore, we point out
inconsistencies in the experimental setup and evaluation of existing FSCIL
methods. To ensure fair comparisons between different FSCIL approaches and
joint training, we standardize training conditions and propose a unified
evaluation protocol that simultaneously considers the validation set and
computational complexity. By establishing a reliable upper bound and a
standardized evaluation framework for FSCIL, our work provides a clear
benchmark and a practical foundation for future research.
☆ Label Unbalance in High-frequency Trading
In financial trading, return prediction is one of the foundation for a
successful trading system. By the fast development of the deep learning in
various areas such as graphical processing, natural language, it has also
demonstrate significant edge in handling with financial data. While the success
of the deep learning relies on huge amount of labeled sample, labeling each
time/event as profitable or unprofitable, under the transaction cost,
especially in the high-frequency trading world, suffers from serious label
imbalance issue.In this paper, we adopts rigurious end-to-end deep learning
framework with comprehensive label imbalance adjustment methods and succeed in
predicting in high-frequency return in the Chinese future market. The code for
our method is publicly available at
https://github.com/RS2002/Label-Unbalance-in-High-Frequency-Trading .
comment: Technical Report
☆ Uncertainty-aware Long-tailed Weights Model the Utility of Pseudo-labels for Semi-supervised Learning
Current Semi-supervised Learning (SSL) adopts the pseudo-labeling strategy
and further filters pseudo-labels based on confidence thresholds. However, this
mechanism has notable drawbacks: 1) setting the reasonable threshold is an open
problem which significantly influences the selection of the high-quality
pseudo-labels; and 2) deep models often exhibit the over-confidence phenomenon
which makes the confidence value an unreliable indicator for assessing the
quality of pseudo-labels due to the scarcity of labeled data. In this paper, we
propose an Uncertainty-aware Ensemble Structure (UES) to assess the utility of
pseudo-labels for unlabeled samples. We further model the utility of
pseudo-labels as long-tailed weights to avoid the open problem of setting the
threshold. Concretely, the advantage of the long-tailed weights ensures that
even unreliable pseudo-labels still contribute to enhancing the model's
robustness. Besides, UES is lightweight and architecture-agnostic, easily
extending to various computer vision tasks, including classification and
regression. Experimental results demonstrate that combining the proposed method
with DualPose leads to a 3.47% improvement in Percentage of Correct Keypoints
(PCK) on the Sniffing dataset with 100 data points (30 labeled), a 7.29\%
improvement in PCK on the FLIC dataset with 100 data points (50 labeled), and a
3.91% improvement in PCK on the LSP dataset with 200 data points (100 labeled).
Furthermore, when combined with FixMatch, the proposed method achieves a 0.2%
accuracy improvement on the CIFAR-10 dataset with 40 labeled data points and a
0.26% accuracy improvement on the CIFAR-100 dataset with 400 labeled data
points.
comment: arXiv admin note: text overlap with arXiv:2408.04150
☆ Detecting Dataset Bias in Medical AI: A Generalized and Modality-Agnostic Auditing Framework
Data-driven AI is establishing itself at the center of evidence-based
medicine. However, reports of shortcomings and unexpected behavior are growing
due to AI's reliance on association-based learning. A major reason for this
behavior: latent bias in machine learning datasets can be amplified during
training and/or hidden during testing. We present a data modality-agnostic
auditing framework for generating targeted hypotheses about sources of bias
which we refer to as Generalized Attribute Utility and Detectability-Induced
bias Testing (G-AUDIT) for datasets. Our method examines the relationship
between task-level annotations and data properties including protected
attributes (e.g., race, age, sex) and environment and acquisition
characteristics (e.g., clinical site, imaging protocols). G-AUDIT automatically
quantifies the extent to which the observed data attributes may enable shortcut
learning, or in the case of testing data, hide predictions made based on
spurious associations. We demonstrate the broad applicability and value of our
method by analyzing large-scale medical datasets for three distinct modalities
and learning tasks: skin lesion classification in images, stigmatizing language
classification in Electronic Health Records (EHR), and mortality prediction for
ICU tabular data. In each setting, G-AUDIT successfully identifies subtle
biases commonly overlooked by traditional qualitative methods that focus
primarily on social and ethical objectives, underscoring its practical value in
exposing dataset-level risks and supporting the downstream development of
reliable AI systems. Our method paves the way for achieving deeper
understanding of machine learning datasets throughout the AI development
life-cycle from initial prototyping all the way to regulation, and creates
opportunities to reduce model bias, enabling safer and more trustworthy AI
systems.
☆ Optimizing Fire Safety: Reducing False Alarms Using Advanced Machine Learning Techniques
Muhammad Hassan Jamal, Abdulwahab Alazeb, Shahid Allah Bakhsh, Wadii Boulila, Syed Aziz Shah, Aizaz Ahmad Khattak, Muhammad Shahbaz Khan
Fire safety practices are important to reduce the extent of destruction
caused by fire. While smoke alarms help save lives, firefighters struggle with
the increasing number of false alarms. This paper presents a precise and
efficient Weighted ensemble model for decreasing false alarms. It estimates the
density, computes weights according to the high and low-density regions,
forwards the high region weights to KNN and low region weights to XGBoost and
combines the predictions. The proposed model is effective at reducing response
time, increasing fire safety, and minimizing the damage that fires cause. A
specifically designed dataset for smoke detection is utilized to test the
proposed model. In addition, a variety of ML models, such as Logistic
Regression (LR), Decision Tree (DT), Random Forest (RF), Nai:ve Bayes (NB),
K-Nearest Neighbour (KNN), Support Vector Machine (SVM), Extreme Gradient
Boosting (XGBoost), Adaptive Boosting (ADAB), have also been utilized. To
maximize the use of the smoke detection dataset, all the algorithms utilize the
SMOTE re-sampling technique. After evaluating the assessment criteria, this
paper presents a concise summary of the comprehensive findings obtained by
comparing the outcomes of all models.
☆ Exploring Mutual Empowerment Between Wireless Networks and RL-based LLMs: A Survey
Reinforcement learning (RL)-based large language models (LLMs), such as
ChatGPT, DeepSeek, and Grok-3, have gained significant attention for their
exceptional capabilities in natural language processing and multimodal data
understanding. Meanwhile, the rapid expansion of information services has
driven the growing need for intelligence, efficient, and adaptable wireless
networks. Wireless networks require the empowerment of RL-based LLMs while
these models also benefit from wireless networks to broaden their application
scenarios. Specifically, RL-based LLMs can enhance wireless communication
systems through intelligent resource allocation, adaptive network optimization,
and real-time decision-making. Conversely, wireless networks provide a vital
infrastructure for the efficient training, deployment, and distributed
inference of RL-based LLMs, especially in decentralized and edge computing
environments. This mutual empowerment highlights the need for a deeper
exploration of the interplay between these two domains. We first review recent
advancements in wireless communications, highlighting the associated challenges
and potential solutions. We then discuss the progress of RL-based LLMs,
focusing on key technologies for LLM training, challenges, and potential
solutions. Subsequently, we explore the mutual empowerment between these two
fields, highlighting key motivations, open challenges, and potential solutions.
Finally, we provide insights into future directions, applications, and their
societal impact to further explore this intersection, paving the way for
next-generation intelligent communication systems. Overall, this survey
provides a comprehensive overview of the relationship between RL-based LLMs and
wireless networks, offering a vision where these domains empower each other to
drive innovations.
comment: 25 pages, 13 figures
☆ MoFlow: One-Step Flow Matching for Human Trajectory Forecasting via Implicit Maximum Likelihood Estimation based Distillation CVPR 2025
In this paper, we address the problem of human trajectory forecasting, which
aims to predict the inherently multi-modal future movements of humans based on
their past trajectories and other contextual cues. We propose a novel motion
prediction conditional flow matching model, termed MoFlow, to predict K-shot
future trajectories for all agents in a given scene. We design a novel flow
matching loss function that not only ensures at least one of the $K$ sets of
future trajectories is accurate but also encourages all $K$ sets of future
trajectories to be diverse and plausible. Furthermore, by leveraging the
implicit maximum likelihood estimation (IMLE), we propose a novel distillation
method for flow models that only requires samples from the teacher model.
Extensive experiments on the real-world datasets, including SportVU NBA games,
ETH-UCY, and SDD, demonstrate that both our teacher flow model and the
IMLE-distilled student model achieve state-of-the-art performance. These models
can generate diverse trajectories that are physically and socially plausible.
Moreover, our one-step student model is $\textbf{100}$ times faster than the
teacher flow model during sampling. The code, model, and data are available at
our project page: https://moflow-imle.github.io
comment: Accepted to CVPR 2025
☆ Identifying Trustworthiness Challenges in Deep Learning Models for Continental-Scale Water Quality Prediction
Xiaobo Xia, Xiaofeng Liu, Jiale Liu, Kuai Fang, Lu Lu, Samet Oymak, William S. Currie, Tongliang Liu
Water quality is foundational to environmental sustainability, ecosystem
resilience, and public health. Deep learning models, particularly Long
Short-Term Memory (LSTM) networks, offer transformative potential for
large-scale water quality prediction and scientific insights generation.
However, their widespread adoption in high-stakes decision-making, such as
pollution mitigation and equitable resource allocation, is prevented by
unresolved trustworthiness challenges including fairness, uncertainty,
interpretability, robustness, generalizability, and reproducibility. In this
work, we present the first comprehensive evaluation of trustworthiness in a
continental-scale multi-task LSTM model predicting 20 water quality variables
(encompassing physical/chemical processes, geochemical weathering, and nutrient
cycling) across 482 U.S. basins. Our investigation uncovers systematic patterns
of model performance disparities linked to basin characteristics, the inherent
complexity of biogeochemical processes, and variable predictability,
emphasizing critical performance fairness concerns. We further propose
methodological frameworks for quantitatively evaluating critical aspects of
trustworthiness, including uncertainty, interpretability, and robustness,
identifying key limitations that could challenge reliable real-world
deployment. This work serves as a timely call to action for advancing
trustworthy data-driven methods for water resources management and provides a
pathway to offering critical insights for researchers, decision-makers, and
practitioners seeking to leverage artificial intelligence (AI) responsibly in
environmental management.
comment: 33 pages, 9 figures, 2 tables
☆ TGP: Two-modal occupancy prediction with 3D Gaussian and sparse points for 3D Environment Awareness
3D semantic occupancy has rapidly become a research focus in the fields of
robotics and autonomous driving environment perception due to its ability to
provide more realistic geometric perception and its closer integration with
downstream tasks. By performing occupancy prediction of the 3D space in the
environment, the ability and robustness of scene understanding can be
effectively improved. However, existing occupancy prediction tasks are
primarily modeled using voxel or point cloud-based approaches: voxel-based
network structures often suffer from the loss of spatial information due to the
voxelization process, while point cloud-based methods, although better at
retaining spatial location information, face limitations in representing
volumetric structural details. To address this issue, we propose a dual-modal
prediction method based on 3D Gaussian sets and sparse points, which balances
both spatial location and volumetric structural information, achieving higher
accuracy in semantic occupancy prediction. Specifically, our method adopts a
Transformer-based architecture, taking 3D Gaussian sets, sparse points, and
queries as inputs. Through the multi-layer structure of the Transformer, the
enhanced queries and 3D Gaussian sets jointly contribute to the semantic
occupancy prediction, and an adaptive fusion mechanism integrates the semantic
outputs of both modalities to generate the final prediction results.
Additionally, to further improve accuracy, we dynamically refine the point
cloud at each layer, allowing for more precise location information during
occupancy prediction. We conducted experiments on the Occ3DnuScenes dataset,
and the experimental results demonstrate superior performance of the proposed
method on IoU based metrics.
☆ Developing and Evaluating an AI-Assisted Prediction Model for Unplanned Intensive Care Admissions following Elective Neurosurgery using Natural Language Processing within an Electronic Healthcare Record System
Julia Ive, Olatomiwa Olukoya, Jonathan P. Funnell, James Booker, Sze H M Lam, Ugan Reddy, Kawsar Noor, Richard JB Dobson, Astri M. V. Luoma, Hani J Marcus
Introduction: Timely care in a specialised neuro-intensive therapy unit (ITU)
reduces mortality and hospital stays, with planned admissions being safer than
unplanned ones. However, post-operative care decisions remain subjective. This
study used artificial intelligence (AI), specifically natural language
processing (NLP) to analyse electronic health records (EHRs) and predict ITU
admissions for elective surgery patients. Methods: This study analysed the EHRs
of elective neurosurgery patients from University College London Hospital
(UCLH) using NLP. Patients were categorised into planned high dependency unit
(HDU) or ITU admission; unplanned HDU or ITU admission; or ward / overnight
recovery (ONR). The Medical Concept Annotation Tool (MedCAT) was used to
identify SNOMED-CT concepts within the clinical notes. We then explored the
utility of these identified concepts for a range of AI algorithms trained to
predict ITU admission. Results: The CogStack-MedCAT NLP model, initially
trained on hospital-wide EHRs, underwent two refinements: first with data from
patients with Normal Pressure Hydrocephalus (NPH) and then with data from
Vestibular Schwannoma (VS) patients, achieving a concept detection F1-score of
0.93. This refined model was then used to extract concepts from EHR notes of
2,268 eligible neurosurgical patients. We integrated the extracted concepts
into AI models, including a decision tree model and a neural time-series model.
Using the simpler decision tree model, we achieved a recall of 0.87 (CI 0.82 -
0.91) for ITU admissions, reducing the proportion of unplanned ITU cases missed
by human experts from 36% to 4%. Conclusion: The NLP model, refined for
accuracy, has proven its efficiency in extracting relevant concepts, providing
a reliable basis for predictive AI models to use in clinically valid
applications.
♻ ☆ Chain-of-Thought Reasoning In The Wild Is Not Always Faithful ICLR 25
Iván Arcuschin, Jett Janiak, Robert Krzyzanowski, Senthooran Rajamanoharan, Neel Nanda, Arthur Conmy
Chain-of-Thought (CoT) reasoning has significantly advanced state-of-the-art
AI capabilities. However, recent studies have shown that CoT reasoning is not
always faithful, i.e. CoT reasoning does not always reflect how models arrive
at conclusions. So far, most of these studies have focused on unfaithfulness in
unnatural contexts where an explicit bias has been introduced. In contrast, we
show that unfaithful CoT can occur on realistic prompts with no artificial
bias. Our results reveal non-negligible rates of several forms of unfaithful
reasoning in frontier models: Sonnet 3.7 (16.3%), DeepSeek R1 (5.3%) and
ChatGPT-4o (7.0%) all answer a notable proportion of question pairs
unfaithfully. Specifically, we find that models rationalize their implicit
biases in answers to binary questions ("implicit post-hoc rationalization").
For example, when separately presented with the questions "Is X bigger than Y?"
and "Is Y bigger than X?", models sometimes produce superficially coherent
arguments to justify answering Yes to both questions or No to both questions,
despite such responses being logically contradictory. We also investigate
restoration errors (Dziri et al., 2023), where models make and then silently
correct errors in their reasoning, and unfaithful shortcuts, where models use
clearly illogical reasoning to simplify solving problems in Putnam questions (a
hard benchmark). Our findings raise challenges for AI safety work that relies
on monitoring CoT to detect undesired behavior.
comment: Accepted to the Reasoning and Planning for Large Language Models
Workshop (ICLR 25), 10 main paper pages, 38 appendix pages
♻ ☆ Correlated Proxies: A New Definition and Improved Mitigation for Reward Hacking ICLR 2025
Because it is difficult to precisely specify complex objectives,
reinforcement learning policies are often optimized using proxy reward
functions that only approximate the true goal. However, optimizing proxy
rewards frequently leads to reward hacking: the optimized reward function
ceases to be a good proxy and the resulting policy performs poorly with respect
to the unspecified true reward. Principled solutions to reward hacking have
been impeded by the lack of a good definition for the problem. To address this
gap, we introduce a definition of reward hacking based on the correlation
between proxy and true rewards for states and actions seen by a "reference
policy" that breaks down under optimization. We show that this definition
captures reward hacking behavior across several realistic settings, including
in reinforcement learning from human feedback (RLHF). Using our formulation, we
show theoretically that regularization to the reference policy can effectively
prevent reward hacking. While the current practice in RLHF applies a KL penalty
between action distributions for this purpose, our theory suggests regularizing
the $\chi^2$ divergence between the policies' occupancy measures can be more
effective. We intuitively show the benefits of this type of regularization and
demonstrate that it better mitigates reward hacking in practice across four
realistic settings, including RLHF. Our code is available at
https://github.com/cassidylaidlaw/orpo.
comment: Spotlight at ICLR 2025
♻ ☆ DataEnvGym: Data Generation Agents in Teacher Environments with Student Feedback ICLR 2025
The process of creating training data to teach models is currently driven by
humans, who manually analyze model weaknesses and plan how to create data that
improves a student model. Approaches using LLMs as annotators reduce human
effort, but still require humans to interpret feedback from evaluations and
control the LLM to produce data the student needs. Automating this
labor-intensive process by creating autonomous data generation agents - or
teachers - is desirable, but requires environments that can simulate the
feedback-driven, iterative, closed loop of data creation. To enable rapid,
scalable testing for such agents and their modules, we introduce DataEnvGym, a
testbed of teacher environments for data generation agents. DataEnvGym frames
data generation as a sequential decision-making task, involving an agent
consisting of a data generation policy (which generates a plan for creating
training data) and a data generation engine (which transforms the plan into
data), inside an environment that provides student feedback. The agent's goal
is to improve student performance. Students are iteratively trained and
evaluated on generated data, and their feedback (in the form of errors or weak
skills) is reported to the agent after each iteration. DataEnvGym includes
multiple teacher environment instantiations across 3 levels of structure in the
state representation and action space. More structured environments are based
on inferred skills and offer more interpretability and curriculum control. We
support 4 domains (math, code, VQA, and tool-use) and test multiple students
and teachers. Example agents in our teaching environments can iteratively
improve students across tasks and settings. Moreover, we show that environments
teach different skill levels and test variants of key modules, pointing to
future work in improving data generation agents, engines, and feedback
mechanisms.
comment: ICLR 2025 Spotlight; Project Page: https://DataEnvGym.github.io
♻ ☆ What is the Alignment Objective of GRPO?
In this note, we examine the aggregation of preferences achieved by the Group
Policy Optimisation (GRPO) algorithm, a reinforcement learning method used to
train advanced artificial intelligence models such as DeepSeek-R1-Zero and
DeepSeekMath. The GRPO algorithm trains a policy using a reward preference
model, which is computed by sampling a set of outputs for a given context,
observing the corresponding rewards, and applying shift-and-scale normalisation
to these reward values. Additionally, it incorporates a penalty function to
discourage deviations from a reference policy.
We present a framework that enables us to characterise the stationary
policies of the GRPO algorithm. This analysis reveals that the aggregation of
preferences differs fundamentally from standard logarithmic pooling, which is
implemented by other approaches such as RLHF. The precise form of preference
aggregation arises from the way the reward preference model is defined and from
the penalty function, which we show to essentially correspond to the reverse
Kullback-Leibler (KL) divergence between the aggregation policy and the
reference policy.
Interestingly, we demonstrate that for groups of size two, the reward
preference model corresponds to pairwise comparison preferences, similar to
those in other alignment methods based on pairwise comparison feedback. We
provide explicit characterisations of the aggregate preference for binary
questions, for groups of size two, and in the limit of large group size. This
provides insights into the dependence of the aggregate preference on parameters
such as the regularisation constant and the confidence margin of question
answers.
Finally, we discuss the aggregation of preferences obtained by modifying the
GRPO algorithm to use direct KL divergence as the penalty or to use rewards
without scale normalisation.
♻ ☆ YouTube Comments Decoded: Leveraging LLMs for Low Resource Language Classification
Sarcasm detection is a significant challenge in sentiment analysis,
particularly due to its nature of conveying opinions where the intended meaning
deviates from the literal expression. This challenge is heightened in social
media contexts where code-mixing, especially in Dravidian languages, is
prevalent. Code-mixing involves the blending of multiple languages within a
single utterance, often with non-native scripts, complicating the task for
systems trained on monolingual data. This shared task introduces a novel gold
standard corpus designed for sarcasm and sentiment detection within code-mixed
texts, specifically in Tamil-English and Malayalam-English languages. The
primary objective of this task is to identify sarcasm and sentiment polarity
within a code-mixed dataset of Tamil-English and Malayalam-English comments and
posts collected from social media platforms. Each comment or post is annotated
at the message level for sentiment polarity, with particular attention to the
challenges posed by class imbalance, reflecting real-world scenarios.In this
work, we experiment with state-of-the-art large language models like GPT-3.5
Turbo via prompting to classify comments into sarcastic or non-sarcastic
categories. We obtained a macro-F1 score of 0.61 for Tamil language. We
obtained a macro-F1 score of 0.50 for Malayalam language.
comment: Updated and Final Version
♻ ☆ Joint Fine-tuning and Conversion of Pretrained Speech and Language Models towards Linear Complexity ICLR 2025
Architectures such as Linformer and Mamba have recently emerged as
competitive linear time replacements for transformers. However, corresponding
large pretrained models are often unavailable, especially in non-text domains.
To remedy this, we present a Cross-Architecture Layerwise Distillation (CALD)
approach that jointly converts a transformer model to a linear time substitute
and fine-tunes it to a target task. We also compare several means to guide the
fine-tuning to optimally retain the desired inference capability from the
original model. The methods differ in their use of the target model and the
trajectory of the parameters. In a series of empirical studies on language
processing, language modeling, and speech processing, we show that CALD can
effectively recover the result of the original model, and that the guiding
strategy contributes to the result. Some reasons for the variation are
suggested.
comment: 18 pages, 5 figures; ICLR 2025 camera ready. Code:
https://github.com/idiap/linearize-distill-pretrained-transformers
♻ ☆ Latent Space Chain-of-Embedding Enables Output-free LLM Self-Evaluation ICLR 2025
LLM self-evaluation relies on the LLM's own ability to estimate response
correctness, which can greatly improve its deployment reliability. In this
research track, we propose the Chain-of-Embedding (CoE) in the latent space to
enable LLMs to perform output-free self-evaluation. CoE consists of all
progressive hidden states produced during the inference time, which can be
treated as the latent thinking path of LLMs. We find that when LLMs respond
correctly and incorrectly, their CoE features differ, these discrepancies
assist us in estimating LLM response correctness. Experiments in four diverse
domains and seven LLMs fully demonstrate the effectiveness of our method.
Meanwhile, its label-free design intent without any training and
millisecond-level computational cost ensures real-time feedback in large-scale
scenarios. More importantly, we provide interesting insights into LLM response
correctness from the perspective of hidden state changes inside LLMs.
comment: Accepted by ICLR 2025
♻ ☆ When Text Embedding Meets Large Language Model: A Comprehensive Survey
Text embedding has become a foundational technology in natural language
processing (NLP) during the deep learning era, driving advancements across a
wide array of downstream tasks. While many natural language understanding
challenges can now be modeled using generative paradigms and leverage the
robust generative and comprehension capabilities of large language models
(LLMs), numerous practical applications-such as semantic matching, clustering,
and information retrieval-continue to rely on text embeddings for their
efficiency and effectiveness. Therefore, how to combine the LLMs and the text
embeddings has become one of the hotspots of academic attention in recent
years. In this survey, we categorize the interplay between LLMs and text
embeddings into three overarching themes: (1) LLM-augmented text embedding,
enhancing traditional embedding methods with LLMs; (2) LLMs as text embedders,
adapting their innate capabilities for high-quality embedding; and (3) Text
embedding understanding with LLMs, leveraging LLMs to analyze and interpret
embeddings. By organizing recent works based on interaction patterns rather
than specific downstream applications, we offer a novel and systematic overview
of contributions from various research and application domains in the era of
LLMs. Furthermore, we highlight the unresolved challenges that persisted in the
pre-LLM era with pre-trained language models (PLMs) and explore the emerging
obstacles brought forth by LLMs. Building on this analysis, we outline
prospective directions for the evolution of text embedding, addressing both
theoretical and practical opportunities in the rapidly advancing landscape of
NLP.
comment: Work in progress
♻ ☆ Confidence-Controlled Exploration: Efficient Sparse-Reward Policy Learning for Robot Navigation
Bhrij Patel, Kasun Weerakoon, Wesley A. Suttle, Alec Koppel, Brian M. Sadler, Tianyi Zhou, Amrit Singh Bedi, Dinesh Manocha
Reinforcement learning (RL) is a promising approach for robotic navigation,
allowing robots to learn through trial and error. However, real-world robotic
tasks often suffer from sparse rewards, leading to inefficient exploration and
suboptimal policies due to sample inefficiency of RL. In this work, we
introduce Confidence-Controlled Exploration (CCE), a novel method that improves
sample efficiency in RL-based robotic navigation without modifying the reward
function. Unlike existing approaches, such as entropy regularization and reward
shaping, which can introduce instability by altering rewards, CCE dynamically
adjusts trajectory length based on policy entropy. Specifically, it shortens
trajectories when uncertainty is high to enhance exploration and extends them
when confidence is high to prioritize exploitation. CCE is a principled and
practical solution inspired by a theoretical connection between policy entropy
and gradient estimation. It integrates seamlessly with on-policy and off-policy
RL methods and requires minimal modifications. We validate CCE across
REINFORCE, PPO, and SAC in both simulated and real-world navigation tasks. CCE
outperforms fixed-trajectory and entropy-regularized baselines, achieving an
18\% higher success rate, 20-38\% shorter paths, and 9.32\% lower elevation
costs under a fixed training sample budget. Finally, we deploy CCE on a
Clearpath Husky robot, demonstrating its effectiveness in complex outdoor
environments.
comment: 10 pages, 6 figures, 2 tables
♻ ☆ InftyThink: Breaking the Length Limits of Long-Context Reasoning in Large Language Models
Advanced reasoning in large language models has achieved remarkable
performance on challenging tasks, but the prevailing long-context reasoning
paradigm faces critical limitations: quadratic computational scaling with
sequence length, reasoning constrained by maximum context boundaries, and
performance degradation beyond pre-training context windows. Existing
approaches primarily compress reasoning chains without addressing the
fundamental scaling problem. To overcome these challenges, we introduce
InftyThink, a paradigm that transforms monolithic reasoning into an iterative
process with intermediate summarization. By interleaving short reasoning
segments with concise progress summaries, our approach enables unbounded
reasoning depth while maintaining bounded computational costs. This creates a
characteristic sawtooth memory pattern that significantly reduces computational
complexity compared to traditional approaches. Furthermore, we develop a
methodology for reconstructing long-context reasoning datasets into our
iterative format, transforming OpenR1-Math into 333K training instances.
Experiments across multiple model architectures demonstrate that our approach
reduces computational costs while improving performance, with Qwen2.5-Math-7B
showing 3-13% improvements across MATH500, AIME24, and GPQA_diamond benchmarks.
Our work challenges the assumed trade-off between reasoning depth and
computational efficiency, providing a more scalable approach to complex
reasoning without architectural modifications.
♻ ☆ Fast MRI for All: Bridging Equity Gaps via Training without Raw Data Access
Physics-driven deep learning (PD-DL) approaches have become popular for
improved reconstruction of fast magnetic resonance imaging (MRI) scans. Though
PD-DL offers higher acceleration rates than existing clinical fast MRI
techniques, their use has been limited outside specialized MRI centers. A key
challenge is generalization to underrepresented pathologies or populations,
noted in multiple studies, with fine-tuning on target populations suggested for
improvement. However, current approaches for PD-DL training require access to
raw k-space measurements, which is typically only available at specialized MRI
centers that have research agreements for such data access. This is especially
an issue for rural and underserved areas, where commercial MRI scanners only
provide access to a final reconstructed image. To tackle these challenges, we
propose Compressibility-inspired Unsupervised Learning via Parallel Imaging
Fidelity (CUPID) for high-quality PD-DL training using only routine clinical
reconstructed images exported from an MRI scanner. CUPID evaluates output
quality with a compressibility-based approach while ensuring that the output
stays consistent with the clinical parallel imaging reconstruction through
well-designed perturbations. Our results show CUPID achieves similar quality to
established PD-DL training that requires k-space data while outperforming
compressed sensing (CS) and diffusion-based generative methods. We further
demonstrate its effectiveness in a zero-shot training setup for retrospectively
and prospectively sub-sampled acquisitions, attesting to its minimal training
burden. As an approach that radically deviates from existing strategies, CUPID
presents an opportunity to provide equitable access to fast MRI for underserved
populations in an attempt to reduce the inequalities associated with this
expensive imaging modality.
♻ ☆ DataMan: Data Manager for Pre-training Large Language Models ICLR2025
The performance emergence of large language models (LLMs) driven by data
scaling laws makes the selection of pre-training data increasingly important.
However, existing methods rely on limited heuristics and human intuition,
lacking comprehensive and clear guidelines. To address this, we are inspired by
``reverse thinking'' -- prompting LLMs to self-identify which criteria benefit
its performance. As its pre-training capabilities are related to perplexity
(PPL), we derive 14 quality criteria from the causes of text perplexity
anomalies and introduce 15 common application domains to support domain mixing.
In this paper, we train a Data Manager (DataMan) to learn quality ratings and
domain recognition from pointwise rating, and use it to annotate a 447B token
pre-training corpus with 14 quality ratings and domain type. Our experiments
validate our approach, using DataMan to select 30B tokens to train a
1.3B-parameter language model, demonstrating significant improvements in
in-context learning (ICL), perplexity, and instruction-following ability over
the state-of-the-art baseline. The best-performing model, based on the Overall
Score l=5 surpasses a model trained with 50% more data using uniform sampling.
We continue pre-training with high-rated, domain-specific data annotated by
DataMan to enhance domain-specific ICL performance and thus verify DataMan's
domain mixing ability. Our findings emphasize the importance of quality
ranking, the complementary nature of quality criteria, and their low
correlation with perplexity, analyzing misalignment between PPL and ICL
performance. We also thoroughly analyzed our pre-training dataset, examining
its composition, the distribution of quality ratings, and the original document
sources.
comment: ICLR2025 paper
♻ ☆ HecVL: Hierarchical Video-Language Pretraining for Zero-shot Surgical Phase Recognition MICCAI2024
Natural language could play an important role in developing generalist
surgical models by providing a broad source of supervision from raw texts. This
flexible form of supervision can enable the model's transferability across
datasets and tasks as natural language can be used to reference learned visual
concepts or describe new ones. In this work, we present HecVL, a novel
hierarchical video-language pretraining approach for building a generalist
surgical model. Specifically, we construct a hierarchical video-text paired
dataset by pairing the surgical lecture video with three hierarchical levels of
texts: at clip-level, atomic actions using transcribed audio texts; at
phase-level, conceptual text summaries; and at video-level, overall abstract
text of the surgical procedure. Then, we propose a novel fine-to-coarse
contrastive learning framework that learns separate embedding spaces for the
three video-text hierarchies using a single model. By disentangling embedding
spaces of different hierarchical levels, the learned multi-modal
representations encode short-term and long-term surgical concepts in the same
model. Thanks to the injected textual semantics, we demonstrate that the HecVL
approach can enable zero-shot surgical phase recognition without any human
annotation. Furthermore, we show that the same HecVL model for surgical phase
recognition can be transferred across different surgical procedures and medical
centers. The code is available at https://github.com/CAMMA-public/SurgVLP
comment: Accepted by MICCAI2024
♻ ☆ Procedure-Aware Surgical Video-language Pretraining with Hierarchical Knowledge Augmentation NeurIPS 2024
Surgical video-language pretraining (VLP) faces unique challenges due to the
knowledge domain gap and the scarcity of multi-modal data. This study aims to
bridge the gap by addressing issues regarding textual information loss in
surgical lecture videos and the spatial-temporal challenges of surgical VLP. We
propose a hierarchical knowledge augmentation approach and a novel
Procedure-Encoded Surgical Knowledge-Augmented Video-Language Pretraining
(PeskaVLP) framework to tackle these issues. The knowledge augmentation uses
large language models (LLM) for refining and enriching surgical concepts, thus
providing comprehensive language supervision and reducing the risk of
overfitting. PeskaVLP combines language supervision with visual
self-supervision, constructing hard negative samples and employing a Dynamic
Time Warping (DTW) based loss function to effectively comprehend the
cross-modal procedural alignment. Extensive experiments on multiple public
surgical scene understanding and cross-modal retrieval datasets show that our
proposed method significantly improves zero-shot transferring performance and
offers a generalist visual representation for further advancements in surgical
scene understanding.The code is available at
https://github.com/CAMMA-public/SurgVLP
comment: Accepted at the 38th Conference on Neural Information Processing
Systems (NeurIPS 2024 Spolight)
♻ ☆ COMBO: Compositional World Models for Embodied Multi-Agent Cooperation ICLR 2025
Hongxin Zhang, Zeyuan Wang, Qiushi Lyu, Zheyuan Zhang, Sunli Chen, Tianmin Shu, Behzad Dariush, Kwonjoon Lee, Yilun Du, Chuang Gan
In this paper, we investigate the problem of embodied multi-agent
cooperation, where decentralized agents must cooperate given only egocentric
views of the world. To effectively plan in this setting, in contrast to
learning world dynamics in a single-agent scenario, we must simulate world
dynamics conditioned on an arbitrary number of agents' actions given only
partial egocentric visual observations of the world. To address this issue of
partial observability, we first train generative models to estimate the overall
world state given partial egocentric observations. To enable accurate
simulation of multiple sets of actions on this world state, we then propose to
learn a compositional world model for multi-agent cooperation by factorizing
the naturally composable joint actions of multiple agents and compositionally
generating the video conditioned on the world state. By leveraging this
compositional world model, in combination with Vision Language Models to infer
the actions of other agents, we can use a tree search procedure to integrate
these modules and facilitate online cooperative planning. We evaluate our
methods on three challenging benchmarks with 2-4 agents. The results show our
compositional world model is effective and the framework enables the embodied
agents to cooperate efficiently with different agents across various tasks and
an arbitrary number of agents, showing the promising future of our proposed
methods. More videos can be found at https://embodied-agi.cs.umass.edu/combo/.
comment: Published at ICLR 2025. 24 pages. The first three authors contributed
equally
♻ ☆ Similarity Equivariant Graph Neural Networks for Homogenization of Metamaterials
Soft, porous mechanical metamaterials exhibit pattern transformations that
may have important applications in soft robotics, sound reduction and
biomedicine. To design these innovative materials, it is important to be able
to simulate them accurately and quickly, in order to tune their mechanical
properties. Since conventional simulations using the finite element method
entail a high computational cost, in this article we aim to develop a machine
learning-based approach that scales favorably to serve as a surrogate model. To
ensure that the model is also able to handle various microstructures, including
those not encountered during training, we include the microstructure as part of
the network input. Therefore, we introduce a graph neural network that predicts
global quantities (energy, stress stiffness) as well as the pattern
transformations that occur (the kinematics). To make our model as accurate and
data-efficient as possible, various symmetries are incorporated into the model.
The starting point is an E(n)-equivariant graph neural network (which respects
translation, rotation and reflection) that has periodic boundary conditions
(i.e., it is in-/equivariant with respect to the choice of RVE), is scale
in-/equivariant, can simulate large deformations, and can predict scalars,
vectors as well as second and fourth order tensors (specifically energy, stress
and stiffness). The incorporation of scale equivariance makes the model
equivariant with respect to the similarities group, of which the Euclidean
group E(n) is a subgroup. We show that this network is more accurate and
data-efficient than graph neural networks with fewer symmetries. To create an
efficient graph representation of the finite element discretization, we use
only the internal geometrical hole boundaries from the finite element mesh to
achieve a better speed-up and scaling with the mesh size.
comment: 60 pages, 22 figures. Published in CMAME (Computer Methods in Applied
Mechanics and Engineering)
♻ ☆ PEMF-VTO: Point-Enhanced Video Virtual Try-on via Mask-free Paradigm
Tianyu Chang, Xiaohao Chen, Zhichao Wei, Xuanpu Zhang, Qing-Guo Chen, Weihua Luo, Peipei Song, Xun Yang
Video Virtual Try-on aims to seamlessly transfer a reference garment onto a
target person in a video while preserving both visual fidelity and temporal
coherence. Existing methods typically rely on inpainting masks to define the
try-on area, enabling accurate garment transfer for simple scenes (e.g.,
in-shop videos). However, these mask-based approaches struggle with complex
real-world scenarios, as overly large and inconsistent masks often destroy
spatial-temporal information, leading to distorted results. Mask-free methods
alleviate this issue but face challenges in accurately determining the try-on
area, especially for videos with dynamic body movements. To address these
limitations, we propose PEMF-VTO, a novel Point-Enhanced Mask-Free Video
Virtual Try-On framework that leverages sparse point alignments to explicitly
guide garment transfer. Our key innovation is the introduction of
point-enhanced guidance, which provides flexible and reliable control over both
spatial-level garment transfer and temporal-level video coherence.
Specifically, we design a Point-Enhanced Transformer (PET) with two core
components: Point-Enhanced Spatial Attention (PSA), which uses frame-cloth
point alignments to precisely guide garment transfer, and Point-Enhanced
Temporal Attention (PTA), which leverages frame-frame point correspondences to
enhance temporal coherence and ensure smooth transitions across frames.
Extensive experiments demonstrate that our PEMF-VTO outperforms
state-of-the-art methods, generating more natural, coherent, and visually
appealing try-on videos, particularly for challenging in-the-wild scenarios.
♻ ☆ The Society of HiveMind: Multi-Agent Optimization of Foundation Model Swarms to Unlock the Potential of Collective Intelligence
Multi-agent systems address issues of accessibility and scalability of
artificial intelligence (AI) foundation models, which are often represented by
large language models. We develop a framework - the "Society of HiveMind"
(SOHM) - that orchestrates the interaction between multiple AI foundation
models, imitating the observed behavior of animal swarms in nature by following
modern evolutionary theories. On the one hand, we find that the SOHM provides a
negligible benefit on tasks that mainly require real-world knowledge. On the
other hand, we remark a significant improvement on tasks that require intensive
logical reasoning, indicating that multi-agent systems are capable of
increasing the reasoning capabilities of the collective compared to the
individual agents. Our findings demonstrate the potential of combining a
multitude of diverse AI foundation models to form an artificial swarm
intelligence capable of self-improvement through interactions with a given
environment.
comment: 11 pages (excl. appendix)
♻ ☆ Semi-supervised Semantic Segmentation for Remote Sensing Images via Multi-scale Uncertainty Consistency and Cross-Teacher-Student Attention
Semi-supervised learning offers an appealing solution for remote sensing (RS)
image segmentation to relieve the burden of labor-intensive pixel-level
labeling. However, RS images pose unique challenges, including rich multi-scale
features and high inter-class similarity. To address these problems, this paper
proposes a novel semi-supervised Multi-Scale Uncertainty and
Cross-Teacher-Student Attention (MUCA) model for RS image semantic segmentation
tasks. Specifically, MUCA constrains the consistency among feature maps at
different layers of the network by introducing a multi-scale uncertainty
consistency regularization. It improves the multi-scale learning capability of
semi-supervised algorithms on unlabeled data. Additionally, MUCA utilizes a
Cross-Teacher-Student attention mechanism to guide the student network, guiding
the student network to construct more discriminative feature representations
through complementary features from the teacher network. This design
effectively integrates weak and strong augmentations (WA and SA) to further
boost segmentation performance. To verify the effectiveness of our model, we
conduct extensive experiments on ISPRS-Potsdam and LoveDA datasets. The
experimental results show the superiority of our method over state-of-the-art
semi-supervised methods. Notably, our model excels in distinguishing highly
similar objects, showcasing its potential for advancing semi-supervised RS
image segmentation tasks.
♻ ☆ Networked Communication for Decentralised Agents in Mean-Field Games
We introduce networked communication to the mean-field game framework, in
particular to oracle-free settings where $N$ decentralised agents learn along a
single, non-episodic run of the empirical system. We prove that our
architecture has sample guarantees bounded between those of the centralised-
and independent-learning cases. We provide the order of the difference in these
bounds in terms of network structure and number of communication rounds, and
also contribute a policy-update stability guarantee. We discuss how the sample
guarantees of the three theoretical algorithms do not actually result in
practical convergence. We therefore show that in practical settings where the
theoretical parameters are not observed (leading to poor estimation of the
Q-function), our communication scheme considerably accelerates learning over
the independent case, often performing similarly to a centralised learner while
removing the restrictive assumption of the latter. We contribute further
practical enhancements to all three theoretical algorithms, allowing us to
present their first empirical demonstrations. Our experiments confirm that we
can remove several of the theoretical assumptions of the algorithms, and
display the empirical convergence benefits brought by our new networked
communication. We additionally show that our networked approach has significant
advantages over both alternatives in terms of robustness to update failures and
to changes in population size.
♻ ☆ Exploring a Multimodal Fusion-based Deep Learning Network for Detecting Facial Palsy IJCAI 2024
Algorithmic detection of facial palsy offers the potential to improve current
practices, which usually involve labor-intensive and subjective assessment by
clinicians. In this paper, we present a multimodal fusion-based deep learning
model that utilizes unstructured data (i.e. an image frame with facial line
segments) and structured data (i.e. features of facial expressions) to detect
facial palsy. We then contribute to a study to analyze the effect of different
data modalities and the benefits of a multimodal fusion-based approach using
videos of 21 facial palsy patients. Our experimental results show that among
various data modalities (i.e. unstructured data - RGB images and images of
facial line segments and structured data - coordinates of facial landmarks and
features of facial expressions), the feed-forward neural network using features
of facial expression achieved the highest precision of 76.22 while the
ResNet-based model using images of facial line segments achieved the highest
recall of 83.47. When we leveraged both images of facial line segments and
features of facial expressions, our multimodal fusion-based deep learning model
slightly improved the precision score to 77.05 at the expense of a decrease in
the recall score.
comment: IJCAI 2024 4th AI for Ageless Aging Workshop (AIAA)
♻ ☆ Towards Generalizable Scene Change Detection CVPR 2025
While current state-of-the-art Scene Change Detection (SCD) approaches
achieve impressive results in well-trained research data, they become
unreliable under unseen environments and different temporal conditions;
in-domain performance drops from 77.6% to 8.0% in a previously unseen
environment and to 4.6% under a different temporal condition -- calling for
generalizable SCD and benchmark. In this work, we propose the Generalizable
Scene Change Detection Framework (GeSCF), which addresses unseen domain
performance and temporal consistency -- to meet the growing demand for anything
SCD. Our method leverages the pre-trained Segment Anything Model (SAM) in a
zero-shot manner. For this, we design Initial Pseudo-mask Generation and
Geometric-Semantic Mask Matching -- seamlessly turning user-guided prompt and
single-image based segmentation into scene change detection for a pair of
inputs without guidance. Furthermore, we define the Generalizable Scene Change
Detection (GeSCD) benchmark along with novel metrics and an evaluation protocol
to facilitate SCD research in generalizability. In the process, we introduce
the ChangeVPR dataset, a collection of challenging image pairs with diverse
environmental scenarios -- including urban, suburban, and rural settings.
Extensive experiments across various datasets demonstrate that GeSCF achieves
an average performance gain of 19.2% on existing SCD datasets and 30.0% on the
ChangeVPR dataset, nearly doubling the prior art performance. We believe our
work can lay a solid foundation for robust and generalizable SCD research.
comment: Camera-ready version. Accepted to CVPR 2025
♻ ☆ ProtTeX: Structure-In-Context Reasoning and Editing of Proteins with Large Language Models
Zicheng Ma, Chuanliu Fan, Zhicong Wang, Zhenyu Chen, Xiaohan Lin, Yanheng Li, Shihao Feng, Jun Zhang, Ziqiang Cao, Yi Qin Gao
Large language models have made remarkable progress in the field of molecular
science, particularly in understanding and generating functional small
molecules. This success is largely attributed to the effectiveness of molecular
tokenization strategies. In protein science, the amino acid sequence serves as
the sole tokenizer for LLMs. However, many fundamental challenges in protein
science are inherently structure-dependent. The absence of structure-aware
tokens significantly limits the capabilities of LLMs for comprehensive
biomolecular comprehension and multimodal generation. To address these
challenges, we introduce a novel framework, ProtTeX, which tokenizes the
protein sequences, structures, and textual information into a unified discrete
space. This innovative approach enables joint training of the LLM exclusively
through the Next-Token Prediction paradigm, facilitating multimodal protein
reasoning and generation. ProtTeX enables general LLMs to perceive and process
protein structures through sequential text input, leverage structural
information as intermediate reasoning components, and generate or manipulate
structures via sequential text output. Experiments demonstrate that our model
achieves significant improvements in protein function prediction, outperforming
the state-of-the-art domain expert model with a twofold increase in accuracy.
Our framework enables high-quality conformational generation and customizable
protein design. For the first time, we demonstrate that by adopting the
standard training and inference pipelines from the LLM domain, ProtTeX empowers
decoder-only LLMs to effectively address diverse spectrum of protein-related
tasks.
comment: 26 pages, 9 figures
♻ ☆ NotaGen: Advancing Musicality in Symbolic Music Generation with Large Language Model Training Paradigms
Yashan Wang, Shangda Wu, Jianhuai Hu, Xingjian Du, Yueqi Peng, Yongxin Huang, Shuai Fan, Xiaobing Li, Feng Yu, Maosong Sun
We introduce NotaGen, a symbolic music generation model aiming to explore the
potential of producing high-quality classical sheet music. Inspired by the
success of Large Language Models (LLMs), NotaGen adopts pre-training,
fine-tuning, and reinforcement learning paradigms (henceforth referred to as
the LLM training paradigms). It is pre-trained on 1.6M pieces of music in ABC
notation, and then fine-tuned on approximately 9K high-quality classical
compositions conditioned on "period-composer-instrumentation" prompts. For
reinforcement learning, we propose the CLaMP-DPO method, which further enhances
generation quality and controllability without requiring human annotations or
predefined rewards. Our experiments demonstrate the efficacy of CLaMP-DPO in
symbolic music generation models with different architectures and encoding
schemes. Furthermore, subjective A/B tests show that NotaGen outperforms
baseline models against human compositions, greatly advancing musical
aesthetics in symbolic music generation.
♻ ☆ Knowledge-data fusion dominated vehicle platoon dynamics modeling and analysis: A physics-encoded deep learning approach
Recently, artificial intelligence (AI)-enabled nonlinear vehicle platoon
dynamics modeling plays a crucial role in predicting and optimizing the
interactions between vehicles. Existing efforts lack the extraction and capture
of vehicle behavior interaction features at the platoon scale. More
importantly, maintaining high modeling accuracy without losing physical
analyzability remains to be solved. To this end, this paper proposes a novel
physics-encoded deep learning network, named PeMTFLN, to model the nonlinear
vehicle platoon dynamics. Specifically, an analyzable parameters encoded
computational graph (APeCG) is designed to guide the platoon to respond to the
driving behavior of the lead vehicle while ensuring local stability. Besides, a
multi-scale trajectory feature learning network (MTFLN) is constructed to
capture platoon following patterns and infer the physical parameters required
for APeCG from trajectory data. The human-driven vehicle trajectory datasets
(HIGHSIM) were used to train the proposed PeMTFLN. The trajectories prediction
experiments show that PeMTFLN exhibits superior compared to the baseline models
in terms of predictive accuracy in speed and gap. The stability analysis result
shows that the physical parameters in APeCG is able to reproduce the platoon
stability in real-world condition. In simulation experiments, PeMTFLN performs
low inference error in platoon trajectories generation. Moreover, PeMTFLN also
accurately reproduces ground-truth safety statistics. The code of proposed
PeMTFLN is open source.
♻ ☆ PAD: Personalized Alignment of LLMs at Decoding-Time ICLR 2025
Aligning with personalized preferences, which vary significantly across
cultural, educational, and political differences, poses a significant challenge
due to the computational costs and data demands of traditional alignment
methods. In response, this paper presents Personalized Alignment at
Decoding-time (PAD), a novel framework designed to align LLM outputs with
diverse personalized preferences during the inference phase, eliminating the
need for additional training. By introducing a unique personalized reward
modeling strategy, this framework decouples the text generation process from
personalized preferences, facilitating the generation of generalizable
token-level personalized rewards. The PAD algorithm leverages these rewards to
guide the decoding process, dynamically tailoring the base model's predictions
to personalized preferences. Extensive experimental results demonstrate that
PAD not only outperforms existing training-based alignment methods in terms of
aligning with diverse preferences but also shows significant generalizability
to preferences unseen during training and scalability across different base
models. This work advances the capability of LLMs to meet user needs in
real-time applications, presenting a substantial step forward in personalized
LLM alignment.
comment: ICLR 2025
♻ ☆ Networked Communication for Mean-Field Games with Function Approximation and Empirical Mean-Field Estimation
Recent algorithms allow decentralised agents, possibly connected via a
communication network, to learn equilibria in Mean-Field Games from a
non-episodic run of the empirical system. However, these algorithms are for
tabular settings: this computationally limits the size of agents' observation
space, meaning the algorithms cannot handle anything but small state spaces,
nor generalise beyond policies depending only on the agent's local state to
so-called 'population-dependent' policies. We address this limitation by
introducing function approximation to the existing setting, drawing on the
Munchausen Online Mirror Descent method that has previously been employed only
in finite-horizon, episodic, centralised settings. While this permits us to
include the mean field in the observation for players' policies, it is
unrealistic to assume decentralised agents have access to this global
information: we therefore also provide new algorithms allowing agents to
locally estimate the global empirical distribution, and to improve this
estimate via inter-agent communication. We show theoretically that exchanging
policy information helps networked agents outperform both independent and even
centralised agents in function-approximation settings. Our experiments
demonstrate this happening empirically, by an even greater margin than in
tabular settings, and show that the communication network allows decentralised
agents to estimate the mean field for population-dependent policies.
♻ ☆ Adaptive Split Learning over Energy-Constrained Wireless Edge Networks
Split learning (SL) is a promising approach for training artificial
intelligence (AI) models, in which devices collaborate with a server to train
an AI model in a distributed manner, based on a same fixed split point.
However, due to the device heterogeneity and variation of channel conditions,
this way is not optimal in training delay and energy consumption. In this
paper, we design an adaptive split learning (ASL) scheme which can dynamically
select split points for devices and allocate computing resource for the server
in wireless edge networks. We formulate an optimization problem to minimize the
average training latency subject to long-term energy consumption constraint.
The difficulties in solving this problem are the lack of future information and
mixed integer programming (MIP). To solve it, we propose an online algorithm
leveraging the Lyapunov theory, named OPEN, which decomposes it into a new MIP
problem only with the current information. Then, a two-layer optimization
method is proposed to solve the MIP problem. Extensive simulation results
demonstrate that the ASL scheme can reduce the average training delay and
energy consumption by 53.7% and 22.1%, respectively, as compared to the
existing SL schemes.
comment: 6 pages, 5 figures, 20 conferences
♻ ☆ KnowPath: Knowledge-enhanced Reasoning via LLM-generated Inference Paths over Knowledge Graphs
Large language models (LLMs) have demonstrated remarkable capabilities in
various complex tasks, yet they still suffer from hallucinations. Introducing
external knowledge, such as knowledge graph, can enhance the LLMs' ability to
provide factual answers. LLMs have the ability to interactively explore
knowledge graphs. However, most approaches have been affected by insufficient
internal knowledge excavation in LLMs, limited generation of trustworthy
knowledge reasoning paths, and a vague integration between internal and
external knowledge. Therefore, we propose KnowPath, a knowledge-enhanced large
model framework driven by the collaboration of internal and external knowledge.
It relies on the internal knowledge of the LLM to guide the exploration of
interpretable directed subgraphs in external knowledge graphs, better
integrating the two knowledge sources for more accurate reasoning. Extensive
experiments on multiple real-world datasets confirm the superiority of
KnowPath.
♻ ☆ Diabetica: Adapting Large Language Model to Enhance Multiple Medical Tasks in Diabetes Care and Management ICLR 2025
Lai Wei, Zhen Ying, Muyang He, Yutong Chen, Qian Yang, Yanzhe Hong, Jiaping Lu, Kaipeng Zheng, Shaoting Zhang, Xiaoying Li, Weiran Huang, Ying Chen
Diabetes is a chronic disease with a significant global health burden,
requiring multi-stakeholder collaboration for optimal management. Large
language models (LLMs) have shown promise in various healthcare scenarios, but
their effectiveness across diverse diabetes tasks remains unproven. Our study
introduced a framework to train and validate diabetes-specific LLMs. We first
developed a comprehensive data processing pipeline that includes data
collection, filtering, augmentation and refinement. This created a
high-quality, diabetes-specific dataset and evaluation benchmarks from scratch.
Fine-tuned on the collected training dataset, our diabetes-specific LLM family
demonstrated state-of-the-art proficiency in processing various diabetes tasks
compared to other LLMs. Furthermore, clinical studies revealed the potential
applications of our models in diabetes care, including providing personalized
healthcare, assisting medical education, and streamlining clinical tasks.
Generally, our introduced framework helps develop diabetes-specific LLMs and
highlights their potential to enhance clinical practice and provide
personalized, data-driven support for diabetes management across different end
users. Our codes, benchmarks and models are available at
https://github.com/waltonfuture/Diabetica.
comment: Accepted by ICLR 2025 SCI-FM workshop
♻ ☆ Deep Reinforcement Learning for Dynamic Resource Allocation in Wireless Networks
This report investigates the application of deep reinforcement learning (DRL)
algorithms for dynamic resource allocation in wireless communication systems.
An environment that includes a base station, multiple antennas, and user
equipment is created. Using the RLlib library, various DRL algorithms such as
Deep Q-Network (DQN) and Proximal Policy Optimization (PPO) are then applied.
These algorithms are compared based on their ability to optimize resource
allocation, focusing on the impact of different learning rates and scheduling
policies. The findings demonstrate that the choice of algorithm and learning
rate significantly influences system performance, with DRL providing more
efficient resource allocation compared to traditional methods.
comment: Upon further review, we found inconsistencies in our analysis and
decided to conduct additional research before resubmitting a revised version
♻ ☆ Revealing Bias Formation in Deep Neural Networks Through the Geometric Mechanisms of Human Visual Decoupling
Deep neural networks (DNNs) often exhibit biases toward certain categories
during object recognition, even under balanced training data conditions. The
intrinsic mechanisms underlying these biases remain unclear. Inspired by the
human visual system, which decouples object manifolds through hierarchical
processing to achieve object recognition, we propose a geometric analysis
framework linking the geometric complexity of class-specific perceptual
manifolds in DNNs to model bias. Our findings reveal that differences in
geometric complexity can lead to varying recognition capabilities across
categories, introducing biases. To support this analysis, we present the
Perceptual-Manifold-Geometry library, designed for calculating the geometric
properties of perceptual manifolds.
♻ ☆ A Triple-Inertial Accelerated Alternating Optimization Method for Deep Learning Training
The stochastic gradient descent (SGD) algorithm has achieved remarkable
success in training deep learning models. However, it has several limitations,
including susceptibility to vanishing gradients, sensitivity to input data, and
a lack of robust theoretical guarantees. In recent years, alternating
minimization (AM) methods have emerged as a promising alternative for model
training by employing gradient-free approaches to iteratively update model
parameters. Despite their potential, these methods often exhibit slow
convergence rates. To address this challenge, we propose a novel
Triple-Inertial Accelerated Alternating Minimization (TIAM) framework for
neural network training. The TIAM approach incorporates a triple-inertial
acceleration strategy with a specialized approximation method, facilitating
targeted acceleration of different terms in each sub-problem optimization. This
integration improves the efficiency of convergence, achieving superior
performance with fewer iterations. Additionally, we provide a convergence
analysis of the TIAM algorithm, including its global convergence properties and
convergence rate. Extensive experiments validate the effectiveness of the TIAM
method, showing significant improvements in generalization capability and
computational efficiency compared to existing approaches, particularly when
applied to the rectified linear unit (ReLU) and its variants.
♻ ☆ Prompt-SID: Learning Structural Representation Prompt via Latent Diffusion for Single-Image Denoising
Many studies have concentrated on constructing supervised models utilizing
paired datasets for image denoising, which proves to be expensive and
time-consuming. Current self-supervised and unsupervised approaches typically
rely on blind-spot networks or sub-image pairs sampling, resulting in pixel
information loss and destruction of detailed structural information, thereby
significantly constraining the efficacy of such methods. In this paper, we
introduce Prompt-SID, a prompt-learning-based single image denoising framework
that emphasizes preserving of structural details. This approach is trained in a
self-supervised manner using downsampled image pairs. It captures
original-scale image information through structural encoding and integrates
this prompt into the denoiser. To achieve this, we propose a structural
representation generation model based on the latent diffusion process and
design a structural attention module within the transformer-based denoiser
architecture to decode the prompt. Additionally, we introduce a scale replay
training mechanism, which effectively mitigates the scale gap from images of
different resolutions. We conduct comprehensive experiments on synthetic,
real-world, and fluorescence imaging datasets, showcasing the remarkable
effectiveness of Prompt-SID. Our code will be released at
https://github.com/huaqlili/Prompt-SID.
♻ ☆ Continuous K-space Recovery Network with Image Guidance for Fast MRI Reconstruction
Magnetic resonance imaging (MRI) is a crucial tool for clinical diagnosis
while facing the challenge of long scanning time. To reduce the acquisition
time, fast MRI reconstruction aims to restore high-quality images from the
undersampled k-space. Existing methods typically train deep learning models to
map the undersampled data to artifact-free MRI images. However, these studies
often overlook the unique properties of k-space and directly apply general
networks designed for image processing to k-space recovery, leaving the precise
learning of k-space largely underexplored. In this work, we propose a
continuous k-space recovery network from a new perspective of implicit neural
representation with image domain guidance, which boosts the performance of MRI
reconstruction. Specifically, (1) an implicit neural representation based
encoder-decoder structure is customized to continuously query unsampled
k-values. (2) an image guidance module is designed to mine the semantic
information from the low-quality MRI images to further guide the k-space
recovery. (3) a multi-stage training strategy is proposed to recover dense
k-space progressively. Extensive experiments conducted on CC359, fastMRI, and
IXI datasets demonstrate the effectiveness of our method and its superiority
over other competitors.
♻ ☆ Is My Text in Your AI Model? Gradient-based Membership Inference Test applied to LLMs
This work adapts and studies the gradient-based Membership Inference Test
(gMINT) to the classification of text based on LLMs. MINT is a general approach
intended to determine if given data was used for training machine learning
models, and this work focuses on its application to the domain of Natural
Language Processing. Using gradient-based analysis, the MINT model identifies
whether particular data samples were included during the language model
training phase, addressing growing concerns about data privacy in machine
learning. The method was evaluated in seven Transformer-based models and six
datasets comprising over 2.5 million sentences, focusing on text classification
tasks. Experimental results demonstrate MINTs robustness, achieving AUC scores
between 85% and 99%, depending on data size and model architecture. These
findings highlight MINTs potential as a scalable and reliable tool for auditing
machine learning models, ensuring transparency, safeguarding sensitive data,
and fostering ethical compliance in the deployment of AI/NLP technologies.
♻ ☆ Determination of galaxy photometric redshifts using Conditional Generative Adversarial Networks (CGANs)
Accurate and reliable photometric redshift determination is one of the key
aspects for wide-field photometric surveys. Determination of photometric
redshift for galaxies, has been traditionally solved by use of machine-learning
and artificial intelligence techniques trained on a calibration sample of
galaxies, where both photometry and spectrometry are available. On this paper,
we present a new algorithmic approach for determining photometric redshifts of
galaxies using Conditional Generative Adversarial Networks (CGANs). The
proposed implementation is able to determine both point-estimation and
probability-density estimations for photometric redshifts. The methodology is
tested with data from Dark Energy Survey (DES) Y1 data and compared with other
existing algorithm such as a Mixture Density Network (MDN). Although results
obtained show a superiority of MDN, CGAN quality-metrics are close to the MDN
results, opening the door to the use of CGAN at photometric redshift
estimation.
♻ ☆ InstructPipe: Generating Visual Blocks Pipelines with Human Instructions and LLMs
Zhongyi Zhou, Jing Jin, Vrushank Phadnis, Xiuxiu Yuan, Jun Jiang, Xun Qian, Kristen Wright, Mark Sherwood, Jason Mayes, Jingtao Zhou, Yiyi Huang, Zheng Xu, Yinda Zhang, Johnny Lee, Alex Olwal, David Kim, Ram Iyengar, Na Li, Ruofei Du
Visual programming has the potential of providing novice programmers with a
low-code experience to build customized processing pipelines. Existing systems
typically require users to build pipelines from scratch, implying that novice
users are expected to set up and link appropriate nodes from a blank workspace.
In this paper, we introduce InstructPipe, an AI assistant for prototyping
machine learning (ML) pipelines with text instructions. We contribute two large
language model (LLM) modules and a code interpreter as part of our framework.
The LLM modules generate pseudocode for a target pipeline, and the interpreter
renders the pipeline in the node-graph editor for further human-AI
collaboration. Both technical and user evaluation (N=16) shows that
InstructPipe empowers users to streamline their ML pipeline workflow, reduce
their learning curve, and leverage open-ended commands to spark innovative
ideas.
comment: CHI 2025
♻ ☆ Column-wise Quantization of Weights and Partial Sums for Accurate and Efficient Compute-In-Memory Accelerators
Compute-in-memory (CIM) is an efficient method for implementing deep neural
networks (DNNs) but suffers from substantial overhead from analog-to-digital
converters (ADCs), especially as ADC precision increases. Low-precision ADCs
can reduce this overhead but introduce partial-sum quantization errors
degrading accuracy. Additionally, low-bit weight constraints, imposed by cell
limitations and the need for multiple cells for higher-bit weights, present
further challenges. While fine-grained partial-sum quantization has been
studied to lower ADC resolution effectively, weight granularity, which limits
overall partial-sum quantized accuracy, remains underexplored. This work
addresses these challenges by aligning weight and partial-sum quantization
granularities at the column-wise level. Our method improves accuracy while
maintaining dequantization overhead, simplifies training by removing two-stage
processes, and ensures robustness to memory cell variations via independent
column-wise scale factors. We also propose an open-source CIM-oriented
convolution framework to handle fine-grained weights and partial-sums
efficiently, incorporating a novel tiling method and group convolution.
Experimental results on ResNet-20 (CIFAR-10, CIFAR-100) and ResNet-18
(ImageNet) show accuracy improvements of 0.99%, 2.69%, and 1.01%, respectively,
compared to the best-performing related works. Additionally, variation analysis
reveals the robustness of our method against memory cell variations. These
findings highlight the effectiveness of our quantization scheme in enhancing
accuracy and robustness while maintaining hardware efficiency in CIM-based DNN
implementations. Our code is available at
https://github.com/jiyoonkm/ColumnQuant.
♻ ☆ The Algorithmic State Architecture (ASA): An Integrated Framework for AI-Enabled Government
As artificial intelligence transforms public sector operations, governments
struggle to integrate technological innovations into coherent systems for
effective service delivery. This paper introduces the Algorithmic State
Architecture (ASA), a novel four-layer framework conceptualising how Digital
Public Infrastructure, Data-for-Policy, Algorithmic Government/Governance, and
GovTech interact as an integrated system in AI-enabled states. Unlike
approaches that treat these as parallel developments, ASA positions them as
interdependent layers with specific enabling relationships and feedback
mechanisms. Through comparative analysis of implementations in Estonia,
Singapore, India, and the UK, we demonstrate how foundational digital
infrastructure enables systematic data collection, which powers algorithmic
decision-making processes, ultimately manifesting in user-facing services. Our
analysis reveals that successful implementations require balanced development
across all layers, with particular attention to integration mechanisms between
them. The framework contributes to both theory and practice by bridging
previously disconnected domains of digital government research, identifying
critical dependencies that influence implementation success, and providing a
structured approach for analysing the maturity and development pathways of
AI-enabled government systems.
comment: Main text: 25 pages, with references: 35 pages, 2 figures
♻ ☆ FlashRNN: I/O-Aware Optimization of Traditional RNNs on modern hardware
While Transformers and other sequence-parallelizable neural network
architectures seem like the current state of the art in sequence modeling, they
specifically lack state-tracking capabilities. These are important for
time-series tasks and logical reasoning. Traditional RNNs like LSTMs and GRUs,
as well as modern variants like sLSTM do have these capabilities at the cost of
strictly sequential processing. While this is often seen as a strong
limitation, we show how fast these networks can get with our
hardware-optimization FlashRNN in Triton and CUDA, optimizing kernels to the
register level on modern GPUs. We extend traditional RNNs with a
parallelization variant that processes multiple RNNs of smaller hidden state in
parallel, similar to the head-wise processing in Transformers. To enable
flexibility on different GPU variants, we introduce a new optimization
framework for hardware-internal cache sizes, memory and compute handling. It
models the hardware in a setting using polyhedral-like constraints, including
the notion of divisibility. This speeds up the solution process in our
ConstrINT library for general integer constraint satisfaction problems (integer
CSPs). We show that our kernels can achieve 50x speed-ups over a vanilla
PyTorch implementation and allow 40x larger hidden sizes compared to our Triton
implementation. Our open-source kernels and the optimization library are
released here to boost research in the direction of state-tracking enabled RNNs
and sequence modeling: https://github.com/NX-AI/flashrnn
♻ ☆ TH-Bench: Evaluating Evading Attacks via Humanizing AI Text on Machine-Generated Text Detectors
As Large Language Models (LLMs) advance, Machine-Generated Texts (MGTs) have
become increasingly fluent, high-quality, and informative. Existing wide-range
MGT detectors are designed to identify MGTs to prevent the spread of plagiarism
and misinformation. However, adversaries attempt to humanize MGTs to evade
detection (named evading attacks), which requires only minor modifications to
bypass MGT detectors. Unfortunately, existing attacks generally lack a unified
and comprehensive evaluation framework, as they are assessed using different
experimental settings, model architectures, and datasets. To fill this gap, we
introduce the Text-Humanization Benchmark (TH-Bench), the first comprehensive
benchmark to evaluate evading attacks against MGT detectors. TH-Bench evaluates
attacks across three key dimensions: evading effectiveness, text quality, and
computational overhead. Our extensive experiments evaluate 6 state-of-the-art
attacks against 13 MGT detectors across 6 datasets, spanning 19 domains and
generated by 11 widely used LLMs. Our findings reveal that no single evading
attack excels across all three dimensions. Through in-depth analysis, we
highlight the strengths and limitations of different attacks. More importantly,
we identify a trade-off among three dimensions and propose two optimization
insights. Through preliminary experiments, we validate their correctness and
effectiveness, offering potential directions for future research.
♻ ☆ Hidden in the Noise: Two-Stage Robust Watermarking for Images
As the quality of image generators continues to improve, deepfakes become a
topic of considerable societal debate. Image watermarking allows responsible
model owners to detect and label their AI-generated content, which can mitigate
the harm. Yet, current state-of-the-art methods in image watermarking remain
vulnerable to forgery and removal attacks. This vulnerability occurs in part
because watermarks distort the distribution of generated images,
unintentionally revealing information about the watermarking techniques.
In this work, we first demonstrate a distortion-free watermarking method for
images, based on a diffusion model's initial noise. However, detecting the
watermark requires comparing the initial noise reconstructed for an image to
all previously used initial noises. To mitigate these issues, we propose a
two-stage watermarking framework for efficient detection. During generation, we
augment the initial noise with generated Fourier patterns to embed information
about the group of initial noises we used. For detection, we (i) retrieve the
relevant group of noises, and (ii) search within the given group for an initial
noise that might match our image. This watermarking approach achieves
state-of-the-art robustness to forgery and removal against a large battery of
attacks.
♻ ☆ Long-horizon Embodied Planning with Implicit Logical Inference and Hallucination Mitigation
Long-horizon embodied planning underpins embodied AI. To accomplish
long-horizon tasks, one of the most feasible ways is to decompose abstract
instructions into a sequence of actionable steps. Foundation models still face
logical errors and hallucinations in long-horizon planning, unless provided
with highly relevant examples to the tasks. However, providing highly relevant
examples for any random task is unpractical. Therefore, we present ReLEP, a
novel framework for Real-time Long-horizon Embodied Planning. ReLEP can
complete a wide range of long-horizon tasks without in-context examples by
learning implicit logical inference through fine-tuning. The fine-tuned large
vision-language model formulates plans as sequences of skill functions. These
functions are selected from a carefully designed skill library. ReLEP is also
equipped with a Memory module for plan and status recall, and a Robot
Configuration module for versatility across robot types. In addition, we
propose a data generation pipeline to tackle dataset scarcity. When
constructing the dataset, we considered the implicit logical relationships,
enabling the model to learn implicit logical relationships and dispel
hallucinations. Through comprehensive evaluations across various long-horizon
tasks, ReLEP demonstrates high success rates and compliance to execution even
on unseen tasks and outperforms state-of-the-art baseline methods.
♻ ☆ MarS: a Financial Market Simulation Engine Powered by Generative Foundation Model ICLR 2025
Generative models aim to simulate realistic effects of various actions across
different contexts, from text generation to visual effects. Despite significant
efforts to build real-world simulators, the application of generative models to
virtual worlds, like financial markets, remains under-explored. In financial
markets, generative models can simulate complex market effects of participants
with various behaviors, enabling interaction under different market conditions,
and training strategies without financial risk. This simulation relies on the
finest structured data in financial market like orders thus building the finest
realistic simulation. We propose Large Market Model (LMM), an order-level
generative foundation model, for financial market simulation, akin to language
modeling in the digital world. Our financial Market Simulation engine (MarS),
powered by LMM, addresses the domain-specific need for realistic, interactive
and controllable order generation. Key observations include LMM's strong
scalability across data size and model complexity, and MarS's robust and
practicable realism in controlled generation with market impact. We showcase
MarS as a forecast tool, detection system, analysis platform, and agent
training environment, thus demonstrating MarS's "paradigm shift" potential for
a variety of financial applications. We release the code of MarS at
https://github.com/microsoft/MarS/.
comment: 35 pages, 26 figures, ICLR 2025
♻ ☆ Next-Generation Database Interfaces: A Survey of LLM-based Text-to-SQL
Generating accurate SQL from users' natural language questions (text-to-SQL)
remains a long-standing challenge due to the complexities involved in user
question understanding, database schema comprehension, and SQL generation.
Traditional text-to-SQL systems, which combine human engineering and deep
neural networks, have made significant progress. Subsequently, pre-trained
language models (PLMs) have been developed for text-to-SQL tasks, achieving
promising results. However, as modern databases and user questions grow more
complex, PLMs with a limited parameter size often produce incorrect SQL. This
necessitates more sophisticated and tailored optimization methods, which
restricts the application of PLM-based systems. Recently, large language models
(LLMs) have shown significant capabilities in natural language understanding as
model scale increases. Thus, integrating LLM-based solutions can bring unique
opportunities, improvements, and solutions to text-to-SQL research. In this
survey, we provide a comprehensive review of existing LLM-based text-to-SQL
studies. Specifically, we offer a brief overview of the technical challenges
and evolutionary process of text-to-SQL. Next, we introduce the datasets and
metrics designed to evaluate text-to-SQL systems. Subsequently, we present a
systematic analysis of recent advances in LLM-based text-to-SQL. Finally, we
make a summarization and discuss the remaining challenges in this field and
suggest expectations for future research directions.
♻ ☆ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM EuroSys'2025
Liang Mi, Weijun Wang, Wenming Tu, Qingfeng He, Rui Kong, Xinyu Fang, Yazhu Dong, Yikang Zhang, Yunchun Li, Meng Li, Haipeng Dai, Guihai Chen, Yunxin Liu
Large Multimodal Models (LMMs) have shown significant progress in various
complex vision tasks with the solid linguistic and reasoning capacity inherited
from large language models (LMMs). Low-rank adaptation (LoRA) offers a
promising method to integrate external knowledge into LMMs, compensating for
their limitations on domain-specific tasks. However, the existing LoRA model
serving is excessively computationally expensive and causes extremely high
latency. In this paper, we present an end-to-end solution that empowers diverse
vision tasks and enriches vision applications with LoRA LMMs. Our system,
VaLoRA, enables accurate and efficient vision tasks by 1) an accuracy-aware
LoRA adapter generation approach that generates LoRA adapters rich in
domain-specific knowledge to meet application-specific accuracy requirements,
2) an adaptive-tiling LoRA adapters batching operator that efficiently computes
concurrent heterogeneous LoRA adapters, and 3) a flexible LoRA adapter
orchestration mechanism that manages application requests and LoRA adapters to
achieve the lowest average response latency. We prototype VaLoRA on five
popular vision tasks on three LMMs. Experiment results reveal that VaLoRA
improves 24-62% of the accuracy compared to the original LMMs and reduces
20-89% of the latency compared to the state-of-the-art LoRA model serving
systems.
comment: EuroSys'2025
♻ ☆ HERO: Human-Feedback Efficient Reinforcement Learning for Online Diffusion Model Finetuning ICLR
Ayano Hiranaka, Shang-Fu Chen, Chieh-Hsin Lai, Dongjun Kim, Naoki Murata, Takashi Shibuya, Wei-Hsiang Liao, Shao-Hua Sun, Yuki Mitsufuji
Controllable generation through Stable Diffusion (SD) fine-tuning aims to
improve fidelity, safety, and alignment with human guidance. Existing
reinforcement learning from human feedback methods usually rely on predefined
heuristic reward functions or pretrained reward models built on large-scale
datasets, limiting their applicability to scenarios where collecting such data
is costly or difficult. To effectively and efficiently utilize human feedback,
we develop a framework, HERO, which leverages online human feedback collected
on the fly during model learning. Specifically, HERO features two key
mechanisms: (1) Feedback-Aligned Representation Learning, an online training
method that captures human feedback and provides informative learning signals
for fine-tuning, and (2) Feedback-Guided Image Generation, which involves
generating images from SD's refined initialization samples, enabling faster
convergence towards the evaluator's intent. We demonstrate that HERO is 4x more
efficient in online feedback for body part anomaly correction compared to the
best existing method. Additionally, experiments show that HERO can effectively
handle tasks like reasoning, counting, personalization, and reducing NSFW
content with only 0.5K online feedback. The code and project page are available
at https://hero-dm.github.io/.
comment: Published in International Conference on Learning Representations
(ICLR) 2025
♻ ☆ Can LLMs Reason About Program Semantics? A Comprehensive Evaluation of LLMs on Formal Specification Inference
Large Language Models (LLMs) are increasingly being used to automate
programming tasks. Yet, LLMs' capabilities in reasoning about program semantics
are still inadequately studied, leaving significant potential for further
exploration. This paper introduces FormalBench, a comprehensive benchmark
designed to evaluate LLMs' reasoning abilities on program semantics,
particularly via the task of synthesizing formal program specifications to
assist verifying program correctness. This task requires both comprehensive
reasoning over all possible program executions and the generation of precise,
syntactically correct expressions that adhere to formal syntax and semantics.
Using this benchmark, we evaluated the ability of LLMs in synthesizing
consistent and complete specifications. Our findings show that LLMs perform
well with simple control flows but struggle with more complex structures,
especially loops, even with advanced prompting. Additionally, LLMs exhibit
limited robustness against semantic-preserving transformations. We also
highlight common failure patterns and design self-repair prompts, improving
success rates by 25%.
♻ ☆ Reinforcement Learning-Enhanced Procedural Generation for Dynamic Narrative-Driven AR Experiences
Procedural Content Generation (PCG) is widely used to create scalable and
diverse environments in games. However, existing methods, such as the Wave
Function Collapse (WFC) algorithm, are often limited to static scenarios and
lack the adaptability required for dynamic, narrative-driven applications,
particularly in augmented reality (AR) games. This paper presents a
reinforcement learning-enhanced WFC framework designed for mobile AR
environments. By integrating environment-specific rules and dynamic tile weight
adjustments informed by reinforcement learning (RL), the proposed method
generates maps that are both contextually coherent and responsive to gameplay
needs. Comparative evaluations and user studies demonstrate that the framework
achieves superior map quality and delivers immersive experiences, making it
well-suited for narrative-driven AR games. Additionally, the method holds
promise for broader applications in education, simulation training, and
immersive extended reality (XR) experiences, where dynamic and adaptive
environments are critical.
comment: Published in Proceedings of the 20th International Joint Conference
on Computer Vision, Imaging and Computer Graphics Theory and Applications -
GRAPP 2025
https://www.scitepress.org/PublicationsDetail.aspx?ID=LfPv9Lfiya8=&t=1
♻ ☆ TPO: Aligning Large Language Models with Multi-branch & Multi-step Preference Trees
In the domain of complex reasoning tasks, such as mathematical reasoning,
recent advancements have proposed the use of Direct Preference Optimization
(DPO) to suppress output of dispreferred responses, thereby enhancing the
long-chain reasoning capabilities of large language models (LLMs). To this end,
these studies employed LLMs to generate preference trees via Tree-of-thoughts
(ToT) and sample the paired preference responses required by the DPO algorithm.
However, the DPO algorithm based on binary preference optimization is unable to
learn multiple responses with varying degrees of preference/dispreference that
provided by the preference trees, resulting in incomplete preference learning.
In this work, we introduce Tree Preference Optimization (TPO), that does not
sample paired preference responses from the preference tree; instead, it
directly learns from the entire preference tree during the fine-tuning.
Specifically, TPO formulates the language model alignment as a Preference List
Ranking problem, where the policy can potentially learn more effectively from a
ranked preference list of responses given the prompt. In addition, to further
assist LLMs in identifying discriminative steps within long-chain reasoning and
increase the relative reward margin in the preference list, TPO utilizes
Adaptive Step Reward to adjust the reward values of each step in trajectory for
performing fine-grained preference optimization. We carry out extensive
experiments on mathematical reasoning tasks to evaluate TPO. The experimental
results indicate that TPO consistently outperforms DPO across five public large
language models on four datasets.
♻ ☆ LaMMA-P: Generalizable Multi-Agent Long-Horizon Task Allocation and Planning with LM-Driven PDDL Planner ICRA 2025
Language models (LMs) possess a strong capability to comprehend natural
language, making them effective in translating human instructions into detailed
plans for simple robot tasks. Nevertheless, it remains a significant challenge
to handle long-horizon tasks, especially in subtask identification and
allocation for cooperative heterogeneous robot teams. To address this issue, we
propose a Language Model-Driven Multi-Agent PDDL Planner (LaMMA-P), a novel
multi-agent task planning framework that achieves state-of-the-art performance
on long-horizon tasks. LaMMA-P integrates the strengths of the LMs' reasoning
capability and the traditional heuristic search planner to achieve a high
success rate and efficiency while demonstrating strong generalization across
tasks. Additionally, we create MAT-THOR, a comprehensive benchmark that
features household tasks with two different levels of complexity based on the
AI2-THOR environment. The experimental results demonstrate that LaMMA-P
achieves a 105% higher success rate and 36% higher efficiency than existing
LM-based multiagent planners. The experimental videos, code, datasets, and
detailed prompts used in each module can be found on the project website:
https://lamma-p.github.io.
comment: IEEE Conference on Robotics and Automation (ICRA 2025); Project
website: https://lamma-p.github.io/
♻ ☆ Prompt-Driven Contrastive Learning for Transferable Adversarial Attacks ECCV 2024
Recent vision-language foundation models, such as CLIP, have demonstrated
superior capabilities in learning representations that can be transferable
across diverse range of downstream tasks and domains. With the emergence of
such powerful models, it has become crucial to effectively leverage their
capabilities in tackling challenging vision tasks. On the other hand, only a
few works have focused on devising adversarial examples that transfer well to
both unknown domains and model architectures. In this paper, we propose a novel
transfer attack method called PDCL-Attack, which leverages the CLIP model to
enhance the transferability of adversarial perturbations generated by a
generative model-based attack framework. Specifically, we formulate an
effective prompt-driven feature guidance by harnessing the semantic
representation power of text, particularly from the ground-truth class labels
of input images. To the best of our knowledge, we are the first to introduce
prompt learning to enhance the transferable generative attacks. Extensive
experiments conducted across various cross-domain and cross-model settings
empirically validate our approach, demonstrating its superiority over
state-of-the-art methods.
comment: Accepted to ECCV 2024 (Oral), Project Page:
https://PDCL-Attack.github.io
♻ ☆ Oasis: One Image is All You Need for Multimodal Instruction Data Synthesis
The success of multi-modal large language models (MLLMs) has been largely
attributed to the large-scale training data. However, the training data of many
MLLMs is unavailable due to privacy concerns. The expensive and labor-intensive
process of collecting multi-modal data further exacerbates the problem. Is it
possible to synthesize multi-modal training data automatically without
compromising diversity and quality? In this paper, we propose a new method,
Oasis, to synthesize high-quality multi-modal data with only images. Oasis
breaks through traditional methods by prompting only images to the MLLMs, thus
extending the data diversity by a large margin. Our method features a delicate
quality control method which ensures the data quality. We collected over 500k
data and conducted incremental experiments on LLaVA-NeXT. Extensive experiments
demonstrate that our method can significantly improve the performance of MLLMs.
The image-based synthesis also allows us to focus on the specific-domain
ability of MLLMs. Code and data will be publicly available.
♻ ☆ DeepInnovation AI: A Global Dataset Mapping the AI innovation from Academic Research to Industrial Patents
In the rapidly evolving field of artificial intelligence (AI), mapping
innovation patterns and understanding effective technology transfer from
research to applications are essential for economic growth. However, existing
data infrastructures suffer from fragmentation, incomplete coverage, and
insufficient evaluative capacity. Here, we present DeepInnovationAI, a
comprehensive global dataset containing three structured files.
DeepPatentAI.csv: Contains 2,356,204 patent records with 8 field-specific
attributes. DeepDiveAI.csv: Encompasses 3,511,929 academic publications with 13
metadata fields. These two datasets leverage large language models,
multilingual text analysis and dual-layer BERT classifiers to accurately
identify AI-related content, while utilizing hypergraph analysis to create
robust innovation metrics. Additionally, DeepCosineAI.csv: By applying semantic
vector proximity analysis, this file presents approximately one hundred million
calculated paper-patent similarity pairs to enhance understanding of how
theoretical advancements translate into commercial technologies.
DeepInnovationAI enables researchers, policymakers, and industry leaders to
anticipate trends and identify collaboration opportunities. With extensive
temporal and geographical scope, it supports detailed analysis of technological
development patterns and international competition dynamics, establishing a
foundation for modeling AI innovation and technology transfer processes.
comment: 32 pages and 8 figures
♻ ☆ Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models
Qiguang Chen, Libo Qin, Jinhao Liu, Dengyun Peng, Jiannan Guan, Peng Wang, Mengkang Hu, Yuhang Zhou, Te Gao, Wanxiang Che
Recent advancements in reasoning with large language models (RLLMs), such as
OpenAI-O1 and DeepSeek-R1, have demonstrated their impressive capabilities in
complex domains like mathematics and coding. A central factor in their success
lies in the application of long chain-of-thought (Long CoT) characteristics,
which enhance reasoning abilities and enable the solution of intricate
problems. However, despite these developments, a comprehensive survey on Long
CoT is still lacking, limiting our understanding of its distinctions from
traditional short chain-of-thought (Short CoT) and complicating ongoing debates
on issues like "overthinking" and "test-time scaling." This survey seeks to
fill this gap by offering a unified perspective on Long CoT. (1) We first
distinguish Long CoT from Short CoT and introduce a novel taxonomy to
categorize current reasoning paradigms. (2) Next, we explore the key
characteristics of Long CoT: deep reasoning, extensive exploration, and
feasible reflection, which enable models to handle more complex tasks and
produce more efficient, coherent outcomes compared to the shallower Short CoT.
(3) We then investigate key phenomena such as the emergence of Long CoT with
these characteristics, including overthinking, and test-time scaling, offering
insights into how these processes manifest in practice. (4) Finally, we
identify significant research gaps and highlight promising future directions,
including the integration of multi-modal reasoning, efficiency improvements,
and enhanced knowledge frameworks. By providing a structured overview, this
survey aims to inspire future research and further the development of logical
reasoning in artificial intelligence.
comment: Paper are available at https://long-cot.github.io/
♻ ☆ AnywhereDoor: Multi-Target Backdoor Attacks on Object Detection
As object detection becomes integral to many safety-critical applications,
understanding its vulnerabilities is essential. Backdoor attacks, in
particular, pose a serious threat by implanting hidden triggers in victim
models, which adversaries can later exploit to induce malicious behaviors
during inference. However, current understanding is limited to single-target
attacks, where adversaries must define a fixed malicious behavior (target)
before training, making inference-time adaptability impossible. Given the large
output space of object detection (including object existence prediction,
bounding box estimation, and classification), the feasibility of flexible,
inference-time model control remains unexplored. This paper introduces
AnywhereDoor, a multi-target backdoor attack for object detection. Once
implanted, AnywhereDoor allows adversaries to make objects disappear, fabricate
new ones, or mislabel them, either across all object classes or specific ones,
offering an unprecedented degree of control. This flexibility is enabled by
three key innovations: (i) objective disentanglement to scale the number of
supported targets; (ii) trigger mosaicking to ensure robustness even against
region-based detectors; and (iii) strategic batching to address object-level
data imbalances that hinder manipulation. Extensive experiments demonstrate
that AnywhereDoor grants attackers a high degree of control, improving attack
success rates by 26% compared to adaptations of existing methods for such
flexible control.
comment: This work was intended as a replacement of arXiv:2411.14243 and any
subsequent updates will appear there
♻ ☆ Driving with Regulation: Interpretable Decision-Making for Autonomous Vehicles with Retrieval-Augmented Reasoning via LLM
This work presents an interpretable decision-making framework for autonomous
vehicles that integrates traffic regulations, norms, and safety guidelines
comprehensively and enables seamless adaptation to different regions. While
traditional rule-based methods struggle to incorporate the full scope of
traffic rules, we develop a Traffic Regulation Retrieval (TRR) Agent based on
Retrieval-Augmented Generation (RAG) to automatically retrieve relevant traffic
rules and guidelines from extensive regulation documents and relevant records
based on the ego vehicle's situation. Given the semantic complexity of the
retrieved rules, we also design a reasoning module powered by a Large Language
Model (LLM) to interpret these rules, differentiate between mandatory rules and
safety guidelines, and assess actions on legal compliance and safety.
Additionally, the reasoning is designed to be interpretable, enhancing both
transparency and reliability. The framework demonstrates robust performance on
both hypothesized and real-world cases across diverse scenarios, along with the
ability to adapt to different regions with ease.
♻ ☆ Multi-agent KTO: Reinforcing Strategic Interactions of Large Language Model in Language Game
Achieving Artificial General Intelligence (AGI) requires AI agents that can
not only make stratigic decisions but also engage in flexible and meaningful
communication. Inspired by Wittgenstein's language game theory in Philosophical
Investigations, we propose that language agents can learn through in-context
interaction rather than traditional multi-stage frameworks that separate
decision-making from language expression. Using Werewolf, a social deduction
game that tests language understanding, strategic interaction, and
adaptability, we develop the Multi-agent Kahneman & Tversky's Optimization
(MaKTO). MaKTO engages diverse models in extensive gameplay to generate
unpaired desirable and unacceptable responses, then employs KTO to refine the
model's decision-making process. In 9-player Werewolf games, MaKTO achieves a
61% average win rate across various models, outperforming GPT-4o and two-stage
RL agents by relative improvements of 23.0% and 10.9%, respectively. Notably,
MaKTO also demonstrates human-like performance, winning 60% against expert
players and showing only 49% detectability in Turing-style blind tests.
comment: Preprint. Code and data will be available at
https://reneeye.github.io/MaKTO.html
♻ ☆ DA-STGCN: 4D Trajectory Prediction Based on Spatiotemporal Feature Extraction
The importance of four-dimensional (4D) trajectory prediction within air
traffic management systems is on the rise. Key operations such as conflict
detection and resolution, aircraft anomaly monitoring, and the management of
congested flight paths are increasingly reliant on this foundational
technology, underscoring the urgent demand for intelligent solutions. The
dynamics in airport terminal zones and crowded airspaces are intricate and
ever-changing; however, current methodologies do not sufficiently account for
the interactions among aircraft. To tackle these challenges, we propose
DA-STGCN, an innovative spatiotemporal graph convolutional network that
integrates a dual attention mechanism. Our model reconstructs the adjacency
matrix through a self-attention approach, enhancing the capture of node
correlations, and employs graph attention to distill spatiotemporal
characteristics, thereby generating a probabilistic distribution of predicted
trajectories. This novel adjacency matrix, reconstructed with the
self-attention mechanism, is dynamically optimized throughout the network's
training process, offering a more nuanced reflection of the inter-node
relationships compared to traditional algorithms. The performance of the model
is validated on two ADS-B datasets, one near the airport terminal area and the
other in dense airspace. Experimental results demonstrate a notable improvement
over current 4D trajectory prediction methods, achieving a 20% and 30%
reduction in the Average Displacement Error (ADE) and Final Displacement Error
(FDE), respectively. The incorporation of a Dual-Attention module has been
shown to significantly enhance the extraction of node correlations, as verified
by ablation experiments.
♻ ☆ KG4Diagnosis: A Hierarchical Multi-Agent LLM Framework with Knowledge Graph Enhancement for Medical Diagnosis AAAI-25
Integrating Large Language Models (LLMs) in healthcare diagnosis demands
systematic frameworks that can handle complex medical scenarios while
maintaining specialized expertise. We present KG4Diagnosis, a novel
hierarchical multi-agent framework that combines LLMs with automated knowledge
graph construction, encompassing 362 common diseases across medical
specialties. Our framework mirrors real-world medical systems through a
two-tier architecture: a general practitioner (GP) agent for initial assessment
and triage, coordinating with specialized agents for in-depth diagnosis in
specific domains. The core innovation lies in our end-to-end knowledge graph
generation methodology, incorporating: (1) semantic-driven entity and relation
extraction optimized for medical terminology, (2) multi-dimensional decision
relationship reconstruction from unstructured medical texts, and (3)
human-guided reasoning for knowledge expansion. KG4Diagnosis serves as an
extensible foundation for specialized medical diagnosis systems, with
capabilities to incorporate new diseases and medical knowledge. The framework's
modular design enables seamless integration of domain-specific enhancements,
making it valuable for developing targeted medical diagnosis systems. We
provide architectural guidelines and protocols to facilitate adoption across
medical contexts.
comment: 10 pages,5 figures,published to AAAI-25 Bridge Program
♻ ☆ Preference Alignment for Diffusion Model via Explicit Denoised Distribution Estimation
Diffusion models have shown remarkable success in text-to-image generation,
making preference alignment for these models increasingly important. The
preference labels are typically available only at the terminal of denoising
trajectories, which poses challenges in optimizing the intermediate denoising
steps. In this paper, we propose to conduct Denoised Distribution Estimation
(DDE) that explicitly connects intermediate steps to the terminal denoised
distribution. Therefore, preference labels can be used for the entire
trajectory optimization. To this end, we design two estimation strategies for
our DDE. The first is stepwise estimation, which utilizes the conditional
denoised distribution to estimate the model denoised distribution. The second
is single-shot estimation, which converts the model output into the terminal
denoised distribution via DDIM modeling. Analytically and empirically, we
reveal that DDE equipped with two estimation strategies naturally derives a
novel credit assignment scheme that prioritizes optimizing the middle part of
the denoising trajectory. Extensive experiments demonstrate that our approach
achieves superior performance, both quantitatively and qualitatively.
♻ ☆ MedHallBench: A New Benchmark for Assessing Hallucination in Medical Large Language Models AAAI-25
Medical Large Language Models (MLLMs) have demonstrated potential in
healthcare applications, yet their propensity for hallucinations -- generating
medically implausible or inaccurate information -- presents substantial risks
to patient care. This paper introduces MedHallBench, a comprehensive benchmark
framework for evaluating and mitigating hallucinations in MLLMs. Our
methodology integrates expert-validated medical case scenarios with established
medical databases to create a robust evaluation dataset. The framework employs
a sophisticated measurement system that combines automated ACHMI (Automatic
Caption Hallucination Measurement in Medical Imaging) scoring with rigorous
clinical expert evaluations and utilizes reinforcement learning methods to
achieve automatic annotation. Through an optimized reinforcement learning from
human feedback (RLHF) training pipeline specifically designed for medical
applications, MedHallBench enables thorough evaluation of MLLMs across diverse
clinical contexts while maintaining stringent accuracy standards. We conducted
comparative experiments involving various models, utilizing the benchmark to
establish a baseline for widely adopted large language models (LLMs). Our
findings indicate that ACHMI provides a more nuanced understanding of the
effects of hallucinations compared to traditional metrics, thereby highlighting
its advantages in hallucination assessment. This research establishes a
foundational framework for enhancing MLLMs' reliability in healthcare settings
and presents actionable strategies for addressing the critical challenge of AI
hallucinations in medical applications.
comment: Published to AAAI-25 Bridge Program
♻ ☆ Conditional diffusions for amortized neural posterior estimation
Neural posterior estimation (NPE), a simulation-based computational approach
for Bayesian inference, has shown great success in approximating complex
posterior distributions. Existing NPE methods typically rely on normalizing
flows, which approximate a distribution by composing many simple, invertible
transformations. But flow-based models, while state of the art for NPE, are
known to suffer from several limitations, including training instability and
sharp trade-offs between representational power and computational cost. In this
work, we demonstrate the effectiveness of conditional diffusions coupled with
high-capacity summary networks for amortized NPE. Conditional diffusions
address many of the challenges faced by flow-based methods. Our results show
that, across a highly varied suite of benchmarking problems for NPE
architectures, diffusions offer improved stability, superior accuracy, and
faster training times, even with simpler, shallower models. Building on prior
work on diffusions for NPE, we show that these gains persist across a variety
of different summary network architectures. Code is available at
https://github.com/TianyuCodings/cDiff.
♻ ☆ SHIP: A Shapelet-based Approach for Interpretable Patient-Ventilator Asynchrony Detection PAKDD 2025
Patient-ventilator asynchrony (PVA) is a common and critical issue during
mechanical ventilation, affecting up to 85% of patients. PVA can result in
clinical complications such as discomfort, sleep disruption, and potentially
more severe conditions like ventilator-induced lung injury and diaphragm
dysfunction. Traditional PVA management, which relies on manual adjustments by
healthcare providers, is often inadequate due to delays and errors. While
various computational methods, including rule-based, statistical, and deep
learning approaches, have been developed to detect PVA events, they face
challenges related to dataset imbalances and lack of interpretability. In this
work, we propose a shapelet-based approach SHIP for PVA detection, utilizing
shapelets - discriminative subsequences in time-series data - to enhance
detection accuracy and interpretability. Our method addresses dataset
imbalances through shapelet-based data augmentation and constructs a shapelet
pool to transform the dataset for more effective classification. The combined
shapelet and statistical features are then used in a classifier to identify PVA
events. Experimental results on medical datasets show that SHIP significantly
improves PVA detection while providing interpretable insights into model
decisions.
comment: Accepted at PAKDD 2025
♻ ☆ Seeing is Understanding: Unlocking Causal Attention into Modality-Mutual Attention for Multimodal LLMs
Recent Multimodal Large Language Models (MLLMs) have demonstrated significant
progress in perceiving and reasoning over multimodal inquiries, ushering in a
new research era for foundation models. However, vision-language misalignment
in MLLMs has emerged as a critical challenge, where the textual responses
generated by these models are not factually aligned with the given text-image
inputs. Existing efforts to address vision-language misalignment have focused
on developing specialized vision-language connectors or leveraging visual
instruction tuning from diverse domains. In this paper, we tackle this issue
from a fundamental yet unexplored perspective by revisiting the core
architecture of MLLMs. Most MLLMs are typically built on decoder-only LLMs
consisting of a causal attention mechanism, which limits the ability of the
earlier modalities (e.g., images) to incorporate information from the latter
modalities (e.g., text). To address this problem, we propose \MapleLeaf AKI, a
novel MLLM that unlocks causal attention into modality-mutual attention (MMA)
to enable image tokens to attend to text tokens. This simple yet effective
design allows AKI to achieve superior performance in 12 multimodal
understanding benchmarks (+7.2% on average) without introducing additional
parameters and increasing training time. Our MMA design is intended to be
generic, allowing for application across various modalities, and scalable to
accommodate diverse multimodal scenarios. The code and model are publicly
available at https://github.com/sony/aki to encourage further advancements in
MLLMs across various directions.
comment: Preprint
♻ ☆ There and Back Again: On the relation between Noise and Image Inversions in Diffusion Models
Diffusion Models achieve state-of-the-art performance in generating new
samples but lack low-dimensional latent space that encodes the data into
meaningful features. Inversion-based techniques try to solve this issue by
reversing the denoising process and mapping images back to their approximated
starting noise. In this work, we thoroughly analyze this procedure and focus on
the relation between the initial Gaussian noise, the generated samples, and
their corresponding latent encodings obtained through the DDIM inversion.
First, we show that latents exhibit structural patterns in the form of less
diverse noise predicted for smooth image regions. Next, we explain the origin
of this phenomenon, demonstrating that, during the first inversion steps, the
noise prediction error is much more significant for the plain areas than for
the rest of the image. Finally, we present the consequences of the divergence
between latents and noises by showing that the space of image inversions is
notably less manipulative than the original Gaussian noise. This leads to a low
diversity of generated interpolations or editions based on the DDIM inversion
procedure and ill-defined latent-to-image mapping. Code is available at
https://github.com/luk-st/taba.
♻ ☆ Accelerating Flood Warnings by 10 Hours: The Power of River Network Topology in AI-enhanced Flood Forecasting
Climate change-driven floods demand advanced forecasting models, yet Graph
Neural Networks (GNNs) underutilize river network topology due to tree-like
structures causing over-squashing from high node resistance distances. This
study identifies this limitation and introduces a reachability-based graph
transformation to densify topological connections, reducing resistance
distances. Empirical tests show transformed-GNNs outperform EA-LSTM in extreme
flood prediction, achieving 24-h water level accuracy equivalent to EA-LSTM's
14-h forecasts - a 71% improvement in long-term predictive horizon. The dense
graph retains flow dynamics across hierarchical river branches, enabling GNNs
to capture distal node interactions critical for rare flood events. This
topological innovation bridges the gap between river network structure and GNN
modeling, offering a scalable framework for early warning systems.
♻ ☆ Non-autoregressive Sequence-to-Sequence Vision-Language Models CVPR 2024
Sequence-to-sequence vision-language models are showing promise, but their
applicability is limited by their inference latency due to their autoregressive
way of generating predictions. We propose a parallel decoding
sequence-to-sequence vision-language model, trained with a Query-CTC loss, that
marginalizes over multiple inference paths in the decoder. This allows us to
model the joint distribution of tokens, rather than restricting to conditional
distribution as in an autoregressive model. The resulting model, NARVL,
achieves performance on-par with its state-of-the-art autoregressive
counterpart, but is faster at inference time, reducing from the linear
complexity associated with the sequential generation of tokens to a paradigm of
constant time joint inference.
comment: Accepted to CVPR 2024
♻ ☆ Toward an Evaluation Science for Generative AI Systems
Laura Weidinger, Inioluwa Deborah Raji, Hanna Wallach, Margaret Mitchell, Angelina Wang, Olawale Salaudeen, Rishi Bommasani, Deep Ganguli, Sanmi Koyejo, William Isaac
There is an increasing imperative to anticipate and understand the
performance and safety of generative AI systems in real-world deployment
contexts. However, the current evaluation ecosystem is insufficient: Commonly
used static benchmarks face validity challenges, and ad hoc case-by-case audits
rarely scale. In this piece, we advocate for maturing an evaluation science for
generative AI systems. While generative AI creates unique challenges for system
safety engineering and measurement science, the field can draw valuable
insights from the development of safety evaluation practices in other fields,
including transportation, aerospace, and pharmaceutical engineering. In
particular, we present three key lessons: Evaluation metrics must be applicable
to real-world performance, metrics must be iteratively refined, and evaluation
institutions and norms must be established. Applying these insights, we outline
a concrete path toward a more rigorous approach for evaluating generative AI
systems.
comment: First two authors contributed equally to this work